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Patent Application 

The present invention relates to the subject matter claimed 
and hence refers to a method and a device for compiling pro- 
grams for a reconfigurable device. 

*eeonfigurable devioes are well-known. They include systolic 
arrays, neuronal networks. Multiprocessor systems, Prosessoren 
comprising a plurality of ALU and/or logic cells, crossbar- 
switches, as well as FPGAs, dpgas, XFUTERs, asf.. Reference is 
being made to DE 44 16 881 Al, Of! 197 81 412 Al, DE 197 81 483 
Al, DE 196 54 846 Al, DE 196 54 593 Al, DE 137 04 044,6 Al f 
DE 199 80 129 Al, DE 198 61 088 Al, DE 199 80 312 Al, 
PCT/DE 00/01869, DE 100 36 627 Al, DE 100 28 397 Al, 
DE 101 10 530 Al, DE 101 11 014 Al, PCT/Ep 00/10516, 
EP 01 102 674 Al, DE 198 80 128 Al, DE 101 39 170 Al, 
DE 198 09 640 Al, DE 199 26 538.0 Al, DE 100 50 442 Al the 
full disclosure of which is incorporated herein for purposes 
of reference. 

Furthermore, reference is being made to devices and methods as 
known from US PS 6,3ll;200,- US PS 6 ,298,472? US FS 6,288,566; 
US PS 6,282,627; US PS 6,243,808 issued to Chameleonsystems 
IMC, USA noting that the disclosure of the present application 
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The invention will now be describe by ths following papers 
which are part of the present application. 
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1 Introduction 

This document describes the PACT Vectorizing C CompUer XPP-VC which maps a C subset extended 
by port access Amotions to PACT'S Native Mapping Language NML. A future extension of this compiler 
for a host-XPP hybrid system is described in Section 7.3. 

XPP-VC uses the public domain SUIF compiler system. For installation instructions on both SUIF and 
XPP-VC, refer to the separately available installation notes. 



2 General Approach 

The XPP-VC implementation is based on the public domain SUIF compiler framework (c£ 
ht tp : / / suif . Stanford . edu)* SUIF was chosen because it is easily extensible. 

SUIF was extended with two passes: partition and nmlgen. The first pass, partition, tests if 
the program complies with the restrictions of the compiler (cf. Section 3.1) and performs a dependence 
analysis. It determines if a FOR-loop can be vectorized and annotates the syntax tree accordingly. In 
XPP-VC, vectorization means that loop iterations are overlapped and executed in a pipelined, parallel 
fashion. This technique is based on the Pipeline Vectorization method developed for reconfigurable 
architectures. 1 partition also completely unrolls inner program FOR-loops which are annotated by 
the user. All innermost loops (after unrolling) which can be vectorized are selected and annotated for 
pipeline synthesis. 

nmlgen generates a control/dataflow graph for the program as follows- First, program data is allocated 
on the XPP Core. By default, nmlgen maps each program array to internal RAM blocks while scalar 
variables are stoned in registers within the PAEs. If instructed by a pragma directive (cf- Section 3.2.2), 
arrays are mapped to external RAM. If it is large enough, an external RAM can hold several arrays. 

Next, one ALU is allocated for each operator in the program (after loop unrolling, if applicable). The 
ALUs are connected according to the data-flow of the program. This data-driven execution of the op- 
erators automatically yields some instruction-level parallelism within a basic block of the program, but 
the basic blocks are normally executed in their original, sequential order, controlled by event signals. 
However, for generating more efficient XPP Core configurations, nmXgen generates pipelined opera- 
tor networks for inner program loops which have been annotated for vectorization by partition. In 
other words, subsequent loop iterations are started before previous iterations have finished. Data packets 
flow continuously through the operator pipelines. By applying pipeline balancing techniques, maximum 
throughput i% achieved. For many programs, additional performance gains are achieved by the complete 
loop unrolling transformation. Though unrolled loops require more XPP resources because individual 
PAEs are allocated for each loop iteration, they yield more parallelism and better exploitation of the XPP 
Core. 

Finally, nmlgen outputs a self-contained NML file containing a module which implements the program . 
on an XPP Core. The XPP JP parameters for the generated NML file are read from a configuration file, 
qL Section 4. Thus the parameters can be easily changed. Obviously, large programs may produce NML 
files which cannot be placed and routed on a given XPP Core* Later XPP-VC releases will perform a 
temporal partitioning of C prog rams in order to overcome this limitation, cf . Section 7. 1 . 

l Ct M. Weinhardt and W. Luk: Pipeline Vectorization, IEEE Transactions on Computer-Aided Design of integrated Circuits 
and Systems, Feb, 2001, pp. 234-248. 
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3 Language Coverage 

This Section describes which C files can currently be handled by XPp-VC. 

3.1 Restrictions 
3.1.1 XPP Restrictions 

The following C language operations cannot be mapped to an XPP Core at all. They are not allowed in 
XPP-VC programs and need to be mapped to the host processor in a codesign compiler, of. Section 7.3, 

• Operating system calls, including J/O 

• Division, modulo, non-constant shift and floating point operations (unless XPP Core's ALU sup- 
ports them) 2 

• The size of arrays mapped to internal RAMs is limited by the number and size of internal RAM 
blocks. 

3.1.2 XPP-VC Compiler Restrictions 

The current XPP-VC implementation necessitates the following restrictions: 

1. No multi-dimensional constant arrays (due to the SUBF version currently used) 

2. No switch/case statements 
3* No struct data types 

4. No function calls except the XPP port and pragma functions defined In Section 3.2. 1 . The program 
must only have one function (main). 

5. No pointer operations 

& No library calls or recursive calls 

7. No irregular control flow (break, continue/ goto, label) 

Additionally, there are currently some implementation-dependent restrictions for vectorized loops, cf, 
the Release Notes. The compiler produces an explanatory message if an inner loop cannot be pipelined 
despite the absence of dependences. However, for many of these ca$e$, simple workarounds by minor 
program changes are available. Furthermore, programs which are too large for one configuration cannot 
be handled. They should be split into several configurations and sequenced onto the XPP Core, using 
NML's reconfiguration commands. This will be performed automatically in later releases by temporal 
partitioning, ct S ection 7.1 . 

z In future XFP-VC releases, an alternative, sequential implementation of these operations by NML macros will bo available. 
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3.2 XPP-VC C Language Extensions 

We now describee language extensions used by XPP-VC. In order to use these extensions, the C program 
must contain the following line: 

finclude "XPP.h" 

This header file, XPP.k* defines the port functions defined below as well as the pragma function 
XPP-unroll () . If XPP.unroll () directly precedes a FOR loop, it will be completely unrolled 
by partition, cf . Section 6.2. 

3,2.1 XPP Port Functions 

Since the normal C I/O functions cannot be used on an XPP Core, a method to access the XPP I/O units 
in port mode is provided. XPP.h contains the definition of the following two functions: 

XPP_getstreaiu<int ionum, int portnuin, int *value) 

XPPjput stream (int ionum, int portnum, int value) 

ionum refers to an I/O unit (1..4), and portnum to the port used in this I/O unit (0 or 1). For the 
duration of the execution of a program, an I/O unit may only be used either for port accesses or for 
RAM accesses (see below). If an I/O unit is used in port mode, each portnum can only be used either 
for read or for write accesses during the entire program execution. In the access functions, value is 
the data received from or written to the stream. Note that XPP-getstream can cuirently only read 
values into scalar variables (not directly into array elements!), whereas XPP-put stream can handle 
any expressions. An example program using these functions is presented in Section 6.1. 



3.2.2 pragma Directives 

Airays can be allocated to external memory by a compiler directive: 
#pragma extern <var> <RAUOvumber> 

Example: #pragma extern x 1 maps array x to external memory bank 1. 
Note the following: 

• <var> must be defined before it is used in the pragma. 

• Bank <RAM_number> must be declared in die file xppveoptions, cL Section 4. 

• If two arrays are allocated to the same external RAM bank, they are arranged in the order of 
appearance of their respective pragma directives. The resulting offsets are recorded in fileJtf, cf. 
Sections.!. 



» 
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4 Directories and Files 

After correct installation, the XPPCJtOOT environment variable is defined, and the PATH variable 
extended. $XPPCJfcOOT is the XPP-VC root directory. $XPPCJROOT/bin contains all binary files 
and the scripts xppvcmake and xppgcc. $XPPCJROOT/doc contains this manual and the file xp- 
pvcjreleasenotes.txt. XPEh is located in the include subdinsctory. 

Finally, $XPPC JROOT/lib contains the options file xppvc^options. if an options file with the same name 
exist in the current working directory or the .xds subdirectory of the user's home directory, they are used 
(in this order) instead of the master file in $XPPCLROOT/iib. 



Option 



debug 

version 

pacsize 

xppsize 

busnumber 

iramsize 

bitwidth 

freg.data.port 

breg_data_port 

freg_event-port 

breg_eventjpoit 



Explanation 



debug output enabled 
XPP IP version 

number of ALtJ-PAEs in x and y direction 

number of PACs in x and y direction 

number of data and event bu$es per row (both dir*s) 

number of words in one internal RAM 

XPP data bit width 

number of FKEG data ports 

number of BREG data ports 

number of FREG event ports 

number of BREG event ports 



Default value in 
xppvq-options 
on 
V2 
6/12 
1/1 
6/6 
256 
32 
3 
3 
4 
4 



Table 1: Options 

xppvc^options sets the compiler options listed in Table J. Most of them define the XPP IP parameters 
which are used in the generated NML file. Lines starting with a # character are comment lines. 

Additionally, extram followed by four integers declares the external RAM banks used for storing ar- 
rays. At most four external RAMs can be used. Each integer represents the size of the bank declared. 
Size zero must be used for banks which do not exist. The master file contains the following line which 
declares four 4GB (1 G words) external banks: 

extram 1073741824 1073741824 1073741824 1073741824 

Note that, in order to simplify programming, xppvcoptions does not have to be changed if an I/O 
unit is used for port accesses. However, this memory bank is not available in this' case despite being 
declared. 



5 Using XPP-VC 



5.1 xppvcmake 

In order to create an NML file,jile.c is compiled with the command xppvcmake f lie .nml. xp- 
pvcmake f ile .xfoin additionally calls xmap. With xppvcmake, XPP.h is automatically searched 
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for in directory $XPPC J&OOT/Include. 

The following output produced by translating the example program streamfinc in Section 6.1 shows the 
programs called by xppvcmake: 

$ xppvcmake streainf ir .nml 

pscc -I/home/wema/xppc/include -parallel -no PORKY_FORWARD_PROP4 

-.spr streamfir.p 
porky -dead-code streamf ir Jspr streamfir *spr2 
partition streamfir . spr2 6treamfir,svo 
Program analysis:' 

main'; DO-L0OP , line 9 can fee synthesized 

main: can be synthesized completely 
Program partitioning; 

Entire program selected for XPU module synthesis. 

main: DO— LOOP/, line 9 selected fox synthesis 
porky -const-prop -scalarize r copy-prop -dead-code streamfir.$vo 

st reamf ir.s vol 
pr edep -normalize st reamf ir*svol streamf ir .svo2 
porky -ivar -know-bounds -fold streamfir ,$vo2 streamf ir .sur 
nmlgen streamf ir, sur streamf ir .xco 

pscc is the SUIF frontend which translates sireamfinclnxo the SXJIF intermediate representation, and 
porky performs some standard optimizations. Next* partition analyzes the program. Hie output 
indicates that the entire program can and will be mapped to NML. Then, porky and predep perform 
some additional optimizations before nmlgen actually generates the file streamfiKnml The SUIF file 
streamfinxco is generated to inspect and debug the result of code transformations* 3 In the generated NML 
file, only the VO ports are placed. All other objects are placed automatically by xmap. Cf. Section 6.1 
for an example of the xsim program using the VO ports corresponding to the stream functions used in 
the program. 

For an input Glefile.c, nmlgen also creates an interface description file JtleJtfin the working directory. It 
shows the array to RAM mapping chosen by the compiler. In the debug subdirectory (which is created), 
GlQsfile.part.dbg zadfile.nmlgen^dbg are generated. They contain more detailed debugging information 
created by partition and nmlgen respectively. The files filejirsudol zn&file.finaLdot created in the 
debug directory can be viewed with the dot ty graph layout tool. They contain graphical representations 
of the original and the transformed and optimised version of the generated control/dataflow graph. 

5*2 xppgcc 

This command is provided for comparing simulation results obtained with xppvcmake, xmap and 
xsim (or from execution on actual XPP hardware) with a "direct" compilation of the C program 
with gcc on the host, xppgcc compiles the input program with gcc and binds it with predefined 
XPP>getst ream and XPP-put stream functions. They read or write files port<n>-<m>.dat in the 
3 In an extended codesign compiler, the jcco file would also be used to generate the host partition of the program. 
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current directory for n in 1-4 and m in 0..1. For instance, the program in Section 6.1 is compiled as 
follows: 

xppgcc -o streamfir streamfir. c 

The resulting program streamfir will read input data ftoraportl-0.dat and write its results toport4 J).dat* 

6 Examples 
6*1 Stream Access 

The following program streamfir.c is a small example showing the usage of the XPP_gefcatream and 
XPP_putstream functions. The infinite WHILE-locp implements a small FIR filter which reads input 
values from port 1.0 and writes output values to port 4,0. The variables xd, xdd and xddd ate used to 
store delayed input values. The compiler automatically generates a shift-register-like configuration for 
these variables. Since no operator dependences exist in the loop, the loop iterations overlap automatically, 
leading to a pipelined FIR filter execution. 

1 #include "XPP.h" 
2 

3 main{} { 

4 int x, xd/ xdd, xddd; 
5 

6 x « 0; 

7 xd - 0; 

8 xdd = 0; 

9 while (1) { 

10 xddd = xdd; 

11 xdd = xd; 

12 xd = x; 

13 XPP_getstream(l, 0, &x) ; 

14 XPP_put stream (4, 0, (2*x + 6*xd + 6*xdd + 2*xddd) » 4); 

15 } 

16 } 

After generating streamfinxbin with the command xppvemake streamfir.xbin, the following 
command reads the input file portlJ>.dat and writes the simulation results to xpp~pon4JD.dat. 

xsim-run 2000 -inl_0 portl_0.dat -out4_0 xpp_port4_0.dat 
streamfir. xbin > /dev/null 

xpp4>ort4J).dat can now be compared with port4J).dat generated by compiling the program with 
xppgcc and r unning it with the same portljQ.dat. 

d* J.rfr VeVC *'-. P ? gramS >8ceh ** data torn or writing result data to external RAMs in xsim cannot be compared to 
NMLfilcs™ programs usin S xppgec. The results may also differ if a bitwidih other than 32 is used for the generated 
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6*2 Array Access 

The following program arrayfir.c is an FIR filter operating on arrays. The first FOR-loop reads input data 
from port 1-0 into array x, the second loop filters x and writes the filtered data into array y, and the third 
loop outputs y on port 4-0. 

1 #include tt XPP.h n 

2 #define N 256 

3 int x[N], y[N]; 

4 const int C[4] { 2, 4, 4, 2 };" 

5 main() { 

6 int l r jr trap; 

7 for (i « 0; i < N; i++) { 

8 XPF_getstream(l, 0, &tmp) ; 

9 x[i] « tmp; 

10 } , 

11 for a 0; i < N-3; i++) { 

12 tmp- = 0; 

13 XPPlunroll () * 

14 for? (j = 0; j < 4; j-M*> { 

15 tmp += c[j]*x£i+3-j]; 

16 p } 

17 y[i+2] = tmp; 

18 > 

19 for (i = 0; i < N-3; i++) 

20 XFP_put stream (4, 0, y[i+2]); 



xppvcmake produces the following output; 
$ xppvcmake arrayfir. nml 

pscc -l/home/wema/*ppc/ include -parallel -no porky_jforwartlJROP4 
-.spr arrayfir.c 

porky -dead-code array fir spr arrayfir .spr 2 

partition arrayfir* spr2 arrayfir -svo 

Program analysis; 

main: FOR^LOOP i, line 7 can be synthesized/ vectorized 

main: FOR— LOOP j, line 14 can be synthesized/unrolled/vectorised 

main: FOR-LOOP i, line 11 can be synthesized/vectorized 

main: FOR— LOOP i, line 19 can be synthesized/vectorized 

main: can be synthesized completely 

Program partitioning: 

Entire program selected for NML module synthesis- 
main: FOR— LOOP i, line 7 selected for pipeline synthesis 
main: FOR— LOOP i, line 11 selected for pipeline synthesis 
main: FOR-LOOP i, line 19 selected for pipeline synthesis 

2002^1-11 ccomp VIA Public 
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♦ . . unrolling loop j 
porky -const -prop -scalarize -copy-prop -dead-code array- 
fir, svo arrayfir,$vol 

predep -normalize array f ir.svol arrayfir ,svo2 

porky -ivar -know-bounds -fold arrayf ir.svo2 array fir.sur 

nmlgen arrayfir.sur arrayf ir.xco 

The messages from partition show that all loops can be vectorized. The dependence analysis did not 
find any loop-carried dependences preventing vectorization. Hie inner loop in the middle of the program 
is unrolled. The outer loop's body is effectively substituted by the following statement: 

y[i+2] = ct0]*x[i+3] + c[l]*x[i+2] + c[2]*x[i+l] + c[3]*x[ij; 

Since all remaining loops are innermost loops, they are selected for pipeline synthesis* Array reads, 
computations, and array writes overlap. To reduce the number of array accesses, the compiler automati- 
cally removes redundant array reads. In the middle loop, only x(i+3J is read- For x C i +2 1 , x [ i+ 1 ] 
and x t i 3 , delayed versions of x ( i+3 ] are used, foiming a shift-register. Therefore, each loop itera- 
tion needs only one cycle since one read from x, all computations, and one write to y can be executed 
concurrently. 

Finally, the following example program fragment is a 2-D edge detection algorithm. 

/* 3x3 horiz, + vert, edge detection in tooth directions */ 
for<v=0; v<=verlen-3; v-h-) { 
for(h«0; h<=HORLEN-3; h+4-) { 

htrap » <pl[v+23[h] - pl[v][h]) + 

<pl[v+2] [h+2] - pl[vJ[h+2]) + 
2 * (Pl[v+2J [h-KL] - pltvJih+1]); 
if (htmp < 0) 
httnp =• - htmp; 

vtmp - <pl[v] [h+2J - pl[v][h]> + 

(pl[v+2J fh+2] - pl[v+2j[hj) + 
2 * <pl[v+lj £h+2J - pl£v+l] [h]); 
if (vtmp < 0) 
vtmp = - vtmp; 

sum = htmp + vtmp; 
if (sum > 255) 

sum » 255; 
p2[v+l] [h+l] - sum; 



As the output of partition shows, both loops can be vectorized. Since only innermost loops can be 
pipelined, the outer loop is executed sequentially. (Note that the line numbers in the program outputs are 
not obvious since only a program fragment is shown above.) 
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partition edge.spr2 edge.svo 

Program analysis: 

main; F0R-L0OF h, line 22 can be synthesized/ can be vectorized 
main: FOR-LOOF v, line 21 can be synthesized/can be vectorised 
main: can be synthesized completely 

Program partitioning: 

Entire program selected for XPP module synthesis, 

main: F0R-L00P h, line 22 selected for pipeline synthesis 

main: FOR-L0OP v, line 21 selected for synthesis 

Also note .the following additional features of this program: Address generators for the 2-D array accesses 
are automatically generated, and the array accesses are reduced by generating shift-registers for each of 
the three image lines accessed. Furthermore, the conditonal statements are implemented using SWAP 
(MUX) operators. Thus the streaming of the pipeline is not affected by which branch the conditional 
statements take. 



7 Future Compiler Extensions 

Apart from removing some of the restrictions of Section 3.1.2, the following extensions are planned for 
XPP-VC. 

7.1 Temporal Partitioning 

By using the pragma function XPPjtextjconf() y programs are partitioned into several configurations 
which are Joaded and executed sequentially on the XPP Core. Specific NML configuration commands are 
generated which also exploit XPP's sophisticated configuration and preloading capabilities. Eventually, 
the temporal partitions will be determined automatically. 

7.2 Program Transformations 

For more efficient XPP configuration generation, some program transformations are useful. In addition 
to loop unrolling, loop merging, loop distribution and loop tiling will be used to improve loop handling, 
i, e. enable more parallelism or better XPP usage. 

Furthermore, programs containing more than one function could be handled by Mining Junction calls. 

7.3 Codesign Compiler 

This section sketches what an extended C compiler for an architecture consisting of an XPP Core com- 
bined with a host processor might look like* The compiler should map suitable program parts, especially 
inner loops, to the XPP Core, and the rest of the program to the host processor. L e., it is a host/XPP 
codesign compiler, and the XPP Core acts as a coprocessor to the host processor. 

This compiler's input language is full standard ANSI C The user uses pragmas to annotate those prey- 
gram parts that should be executed by the XPP Core (manual partitioning). The compiler checks if the 
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selected pans can be implemented on the XPP. Program parts containing non-mappable operations must 
be executed by the host. 

The program parts running on the host processor <"SW"), and the parts running on the PAE ar- 
ray ("XPP") cooperate using predefined routines {copyjdataJoJ&P, copyjiattt-toJiost, start-conMn), 
waitjor-coprocesser-fimskbt), reg U e S t.confi s (n)), For all XPP program parts, XPP configurations are 
generated. In the program code, the XPP part n is replaced by requ es tjconfi s (n), startuconfig( n ) 
waitJo^copncessor-jmish(n) t and the necessary data movements Since the SUIF compiler contains 
a C backend, the altered program (host parts with coprocessor calls) can simply be written back to a C 
file and then processed by me native C compiler of the host processor. 

Thus the sequential control flow of the C program defines when XPP parts are configured into the XPP 
Core and executed. 
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Abstract 

The eXtreme Processing Platform (XPP) technology of- 
fers a unique reconfigurable computing platform supported 
with a set of tools. A C compiler, which integrates both 
new and efficient compilation techniques and temporal par- 
titioning, is presented Temporal partitioning guarantees- 
the compilation of programs with unlimited complexity as 
long as the supported C-subset is used. A new partitioning 
scheme, which permits to map large loops of any kind and is 
neither constrained by loop-dependencies nor nested struc- 
tures, is also presented Furthermore, temporal partitioning 
is applied to reduce the configuration time overhead and 
thits can lead to performance gains. The compilation f torn 
C code to the configuration data, ready to be downloaded 
onto the XPP, takes seconds for complex examples, which 
is, as far as we know, not reproduced by any other recon- 
figurable computing technology. The compiler represents 
a step forward, by furnishing a truly "push+button* op- 
proach, only comparable to microprocessor domains, and 
thus can spread the use of the XPP technology and deal 
with time-to-market pressures positively. 



1 Introduction 

Many of today *s applications ate characterized by inten- 
sive data-stream processing and high-performance require- 
ments. Such performance is more and more evident to not 
be accomplished with today's microprocessor technology. 
Conventional processors (including DSPs) are geared for 
sequential processing. Multi-D$P and very large instruc- 
tion word (VLXW) processors still have severe memory bot- 
tlenecks, lack the number of data ports required to support 
multi-channel, high speed data streams, and fail on furnish- 
ing low power consumption solutions. Accelerating spe- 
cific functions using application-specific integrated circuits 
(ASICs) relieves some of the processing burden, adds some 
required features, but limits flexibility and requires expen- 
sive non-rccuning engineering (NRE) costs and long design 



cycles. High density field-programmable gate arrays (FP- 
GAs) eliminate the NRE costs, add flexibility, but still re- 
quire long timing optimizations and verification cycles and 
low level hardware efforts. Additionally, the fine-grained 
structure adopted in FPGAs is not suitable to map at the 
algorithmic level, which is proved by the well-known diffi- 
culties to have a "push-button** high-level methodology to 
program these architectures 

New reconfigurable processing units (RPUs) are being 
introduced trying to solve those problems (1J. One of die 
new promising architectures is the XPP [2][3J. The XPP 
is a coarse-grained, mntime-reconfigurable, 2-D array par- 
allel structure. The architecture was designed to facilitate 
programming and to support pipelining, dataflow computa- 
tions, and parallelism from the instruction to the task level 
efficiently. Therefore, this technology is well suited for ap- 
plications in multimedia, telecommunications, simulation, 
digital signal processing, and similar stream-based applica- 
tion domains. The XPP architecture also supports dynamic 
self-reconfiguration in a user transparent way. In order to 
drastically reduce the time to program the XPP, and to keep 
the user from architecture details, a higji-level compiler in- 
tegrating temporal partitioning is requited. Such a compiler 
is the main topic of this paper. 

This paper is organized as follows. The next section 
introduces briefly the XPP technology. Section 3 outlines 
compilation to the XPP and section 4 describes the tempo- 
ral partitioning steps. Section 5 shows some experimental 
results, section 6 points out the main differences between 
this and previous works, and finally section 7 concludes the 
paper and enumerates ongoing and future work planned. 



2 XPP Technology 

The XPP technology consists of a reconfigurable com- 
puting platform delivered as a device Or an intellectual prop- 
erty (IP) core, and a complete development tool suite (XD$) 
PI- An XPP can be used as a coprocessor for CPU and DSP 
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architectures. A prior version of the technology has resulted 
in the XPU128-ES [3], a prototype device, which was pro- 
duced in silicon. 

The XPP architecture is based ori a hierarchical array 
of coarse-grain, adaptive computing elements called Pro- 
cessing Array Elements (PAEs), and a packet-oriented com* 
municatton network. The strength of the XPP technology 
originates from the combination of array processing with 
unique and powerful run-time reconfiguration mechanisms. 
Different tasks or applications can be configured and run 
independently on different parts of the array. Reconfigura- 
tion is triggered externally or even by special event signals 
originating within the array, enabling self-reconfiguring de- 
signs. By uSlia'ng protocols implemented in hardware, data 
and event packets are used to process, generate, decompose 
and merge streams of data. 

2.1 Array Structure 

An XPP contains one or several Processing Array Clus* 
ters (PACs), i.e., rectangular blocks of PAEs. Fig. J shows 
the structure of a typical XPP device^ It contains four PAC$ 
(see top left-hand side). Each PAC is attached to a Configu- 
ration Manager (CM) responsible for writing configuration 
data into the configurable objects of the PAC using a ded- 
icated bus. Multi-PAC XPPs contain additional CMs for 
configuration data handling, forming a hierarchical tree of 
CMs.~ The root CM is called the supervising CM (SCM). 
It has an external interface (dotted arrow originating from 
the SCM in Fig.l) which usually connects the SCM to an 
external configuration memory. A CM consists of a state 
machine and internal RAM for configuration caching (see 
cop right-hand side of Fig.l). 

Horizontal busses carry data and events. They can be 
segmented by configurable switch-objects, and connected 
to PAEs and special I/O objects at the periphery of the de- 
vice. The I/O objects can be used for data-streaming or 
to access external resources (e.g., memories). A column 
of pons to the corresponding leaf CM is located on the 
array- A CMPon can be used to send events to the CM 
from the array. The typical PAE shown in Fig.l (bottom 
center) contains three objects! one FREG (forward reg- 
ister), one BREG (backward register) and one ALU. The 
FREG object is used for vertical forward routing (with a 
programmable number of register stages), or to perform 
MERGE, SWAP or DEMUX operations (for controlled 
stream manipulations). The BREG object is used for verti- 
cal backward routing (registered or not), or to perform some 
selected arithmetic operations (e.g. # ADD, SUB, SHIFT}. 
The BREGs can also be used to perform logical operations 
on events. Each ALU (see its internal structure on the bot- 
tom left-hand side of FSg,l) performs common two-input 
fixed-point arithmetical and logical Operations, and com- 



parisons. A MAC (multiply and accumulate) operation can 
be performed using the ALU and the BREG objects of one 
PAE in a single dock cycle. 

Another standard PAE object is the memory object 
which can be used in FIFO mode or as RAM for lookup, 
tables, intermediate results, etc. If such objects are needed 
they are located in the left and/or right columns of PAEs of 
each PAC. However, any PAE object functionality can be 
included in the XPP architecture. 

A set of pararneterizable features can be used to fur- 
nish an XPP that best fits to user and application demands. 
Those features include: the number of PACs and their PAEs, 
number of internal memories, number of I/O ports, number 
Of buses, word bitwidth, cache size, depth of the FIFO to 
configure each object, etc. 
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Figure 1: XPP architecture. 



2.2 Packet Handling and Synchronization 

PAE objects as defined above communicate via a packet- 
oriented network. Two types of packets are sent through the 
array: data and event packets. Data packets have a uniform 
bitwidth specific to the XPP Core or device. 

In normal operation mode, PAE objects are self- 
synchronizing. An operation is performed as soon as all 
necessary data input packets are available. The results are 
forwarded as soon as they are computed and the previous 
results have been consumed. Thus, a signal-flow graph can 
be mapped directly to the ALU objects and data-streams can 
flow through them in a pipelined manner without adding 
specific hardware- 
Event packets are one bit wide. They transmit state in- 
formation which controls ALU execution and packet gener- 
ation. For instance, they can be used to control the rnerg- 
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ing of data-streams or to deliberately discard data packets. 
Thus, conditional computations depending on the results of 
earlier ALU operations are feasible. Events can even trigger 
a self-reconfiguration of the device as explained below. 

Each data or event packet is only forwarded if the pre- 
vious one has already been consumed. The communication 
system was designed to transmit one packet on each inter* 
connect per cycle. Hardware protocols ensure that no pack- 
ets are lost* even in the case of pipeline stalls or during the 
configuration process* This simplifies application develop- 
ment considerably. No explicit scheduling of operations 
Is required, 

23 Configuration 

The XPP architecture is optimized for rapid and user- 
transparent configuration. For this purpose* the configura- 
tion managers in the CM tree operate independently (with- : 
out global synchronization), and therefore are able to con- 
figure their respective pans of the array in parallel. Ev- 
ery PAE stores locally its current configuration state, i,e. T 
if it is part of a configuration or not (states "configured" 
or "free"). Once a PAE is configured, it changes its state 
to "configured". This prevents the respective CM from re- 
configuring a PAE which is still in use. Hie CM caches 
the configuration data in its internal RAM and constantly 
tries to configure the objects used by the next configuration 
requested. Each XPP object has a configuration FIFO 
which stores data of subsequent configurations. Once an 
object has been released (state "free'O* the next configura- 
tor! word in its FIFO is loaded immediately. Hence it is 
possible to reconfigure partially in one clock cycle. Addi- 
tionally; a prefetching mechanism is used. While a config- 
uration is being loaded onto the FIFO of each object, other 
configurations may already be requested and cached in the 
low-level CMS* internal RAM. Thus, it does not need to be 
requested all the way from the SCM down to the array when 
objects become available. While loading a configuration, 
its PAEs start their part of the computations as soon as 
they are in state "configured". 

Each ALU object has an input event port that triggers 
the self-releasing of its resources and of all of the objects 
connected to it* Such event is successively broadcasted ac- 
cording to the interconnections. 

Because of its course-grain nature, an XPP device can 
be configured rapidly. Since only the configuration of those 
array objects actually used is necessary, the configuration 
time depends on the application. 

2.4 Development Tools 

The XPP can be programmed by using the Native Map- 
ping Language (NML) [2], a PACT proprietary structural 



language with reconfiguraton primitives. It gives the pro- 
grammer direct access to all hardware features. In NMJL, 
configurations consist of modules which are specified as 
in a structural hardware description language, similar to f 
for instance, structural VHDL. PAE objects are explicitly 
allocated, optionally placed, and their connections speci- 
fied. Additionally, NML includes statements to support 
configuration handling. Thus, configuration handling is 
an explicit part of the NML application program. XDS 
is an integrated environment for programming with NML. 
The main component is the mapper xmap which compiles 
NML source files, places and routes the objects, and gener- 
ates XPP binary files, xmap uses an enhanced force-based 
placer with short runtimes. The XPP binaries can either be 
simulated and visualized cycle by cycle with the xsim and 
xvis tools* or directly executed on an XPP device. A high- 
level compiler, described in the next section, has been added 
to XDS and permits to map C programs onto the XPP. 

IS Application Execution on XPP 

Reconfiguration and prefetching requests can be issued 
by any CM in the tree (including the SCM which can re- 
spond to external requests) and also by event signals gen- 
erated in the array itself. Running modules can do a self- 
releasing of their resources and request another config- 
uration. Thus, it is possible to execute an application con- 
sisting of several configurations without any external con- 
trol. 

The CM of the XPP permits to exploit speculative con- 
figuration 1 , i.e. f the configuration of a module possibly 
used after the current one has finished execution. If the path 
which includes that module is taken, the CM only has to 
trigger the execution of the configuration (see the section of 
the NML code in F3g.2 and the simulation performed with 
xsim in Fig.3, where confLMOD2 is speculatively config- 
ured during the execution of cori? L.MOD0). If this path is 
not taken, the CM triggers the releasing of the resources 
already configured and requests the other configuration. 



3 Compiling C Code with XPP-VC 

The XPP Vectorizing C Compiler XPP-VC is based on 
the SU1P compiler framework [4]« SUIF is used because 
of its easily extensible properties. The XPP-VC compila- 
tion flow is shown in Fig.4. An options file, used by the 
compiler, specifies the parameters of the targeted XPP and 
the external memories connected to the XPP. lb access XPP 
J/O ports specific C-functions are provided. 

'This has similarities to speculative execution. In ihb cage, before 
knowing if a. configuration, wiU be requested, its configuration Is staned. 
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CONFIG cOnCMODO { . 
CONFJVIODU1-E(MODO) // request the configuration of MODO 
REQUEST(coofJtfOD2_spec) // start speculative configuration 
// if (MODO-CMPonO=; "0") then ooaLM0D2jescfte is requested 
CON]p , _CMPOm , (MOD0.CMPCrta confLMOD^cfic, J 
// if <MOD0.CMPortl aa IT) then conLMODI is requested 
CONF_CMPORT<MOD0,CMPcrtl 4 conrLMODl, 

} 

CONFTG C0nf«MOD^_$pec f // fequest the configuration of MOD2 
CONF_MODUL.«(M0D2) // btil <Jo nol start U 

} 

CONFIG cgnCMOD*_exec f // M0D2 is taken 
SBT( MOD2 £uirt.A = I) //enable the start of computing of M0D2 
REQUEST(ccnfJVlOD3) // request the next configuration 

1 

CONJPlG coqCMODl I // MODI is taken 
REOO£ST(confJvlOD2_rcc) // releasing of resources of MOD2 
CONFJtf ODtXU@*MODl) // request the MODI 
REQtlE$T(fionLMOD3) // request ihe next configuration 

CONFIG conLMODTjKc [ 
RECONtf(MODl«SurO // release the resources of MODI 

) 



Figure 2: Section of NML describing the control flow. 



The compiler starts with some arcltitecture-fcidependent 
preprocessing passes based on well-known compilation 
techniques [5]. During this step, FOR loops are automat- 
ically unrolled if instructed by the programmer. Then the 
compiler performs a data-dependence analysis. The com- 
piler tries to vectorise inner program FOR-Ioops* In XPP- 
VC, vectorization means that loop iterations are overlapped 
and executed in a pipelined, parallel fashion. This technique 
is based on the Pipeline Vectorization method developed for 
reconfigurable architectures [6J. 

The C program can be manually splitted in several mod- 
ules by using annotations. Otherwise, automatic temporal 
partitioning can be applied (see section 4) in Order to furnish 
mappable modules and to reduce the overall latency, 

MODGen generates one NML module for each temporal 
partition. First, program data is allocated on the XFP. By 
default, MODGen maps each program array to internal or 
external RAM while scalar variables are stored in registers 
within the PAEs. Next, a control/dataflow graph (CDFG) is 
generated. Straight-line code without array accesses can be 
directly mapped to a data-flow graph since the data depen- 
dences are obvious in the DAG representation. One ALU 
i$ allocated for each operator in the CDFG. Because of the 
self-synchronization of operators on the XPP, no explicit 
control or scheduling is needed. The same is true for condi- 
tional execution of such blocks. Both branches are executed 
in parallel and MUX operators select the correct output (and 
discard the other one) depending on the condition. This 
data-driven execution of the operators automatically yields 
instruction-level parallelism. In contrast, accesses to the 
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Figure 3: Speculative configuration (enables earlier activa- 
tion of MOD2). 



same array have to be controlled explicitly to maintain the 
correct execution order. MERGE operators (which select 
one input without discarding the other one) route address 
and write data packets in the correct order to the RAM, and 
DEMUX operators route read data packets to the correct 
subsequent operator. State machines for generating the cor- 
rect sequence of event signals (to control these operators) 
are synthezised by the compiler. For conditional branches, 
containing airay accesses or inner loops, DEMUX opera- 
tors controlled by the IF condition route data packets only 
to the selected branch, and output values are taken from the 
branch activated. Thus, only selected branches receive data 
packets and execute. 

pi loops* all variables updated in the loop body are han- 
dled a$ follows. The first iteration uses an input packet for 
the variable's value, and the subsequent iterations use pack- 
ets generated in the previous iteration. In all but the last 
iteration, a DEMUX operator routes the outputs of the loop 
body back to the body inputs. Only the results of the last 
iteration are routed to the loopoucput by the DEMUX oper- 
ators. The control packets for the DEMUX are generated by 
the loop counter or the comparator evaluating the exit con<* 
dition. Note that the internal operators* outputs cannot just 
be connected to subsequent operators since they produce a 
result in each loop iteration. The required last packet would 
be hidden by a stream of intermediate packets. If array ac- 
cesses are present, a loop iteration may only be started after 
the previous iteration has terminated because the original 
access order must be maintained. This is enforced by event 
signals. 

For generating more efficient XPP configurations, MOD- 
Gen generates pipelined operator networks for inner pro- 
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gram loops which have been annotated for vectorizstion by 
the preprocessing step. In other words, subsequent loop 
iterations are started before previous iterations have fin- 
ished. Data packets flaw continuously through the operator 
pipelines. By applying pipeline balancing techniques, max- 
imum throughput is achieved. For many programs, addi- 
tional performance gains are achieved by the complete loop 
unrolling transformation. Although unrolled loops require 
usually more XPP resources, they yield more parallelism 
and better exploitation of the XPP. lb reduce the number of 
array accesses, the compiler automadcaUy removes redun- 
dant array reads. When array references inside loops access 
subsequent element positions the compiler only uses one 
reference and generates delayed structures, forming shift- 
registers. 

Finally, each module generated by MODGen is placed 
and routed automatically by xmap- 

The XPP-VC compiler currently supports a Csubset 
sufficient for piogramming real applications, struct data 
types, pointer operations, irregular control flow (break, 
continue, goto, label), and recursive and operating 
system calls are not supported or cannot be mapped to the 
XPP. 



4 Temporal Partitioning 

A program too large to tit in an XPP can be handled by 
splitting it in several parts (configurations) such that each 
one is mappable. Temporal partitioning permits the auto- 
matic exposing of configurations such that the overall exe- 
cution time of the application is nunimized and is sucess- 
fully mapped onto the XPP resources. It considers the costs 
to load into the cache* to configure and to execute each con- 
figuration with the XPR An important strategy that is con- 
sidered is to pre-fetch configurations while another is be- 
ing configured or is running. Arrays of constants or with 
pre-defined values used in one or more configurations can 
be initialized in parallel with the execution of the previous 
configurations. This takes advantage of the initialization of 
the array carried out by using the configuration bus. 

The set of partitions resulting from the splitting are then 
processed by MODGen, generating a set of configurations. 
Next, specific NML configuration commands are generated 
which also exploit XPP*s sophisticated configuration and 
pre-fetching capabilities, and specify the configuration con- 
trol flow that is orchestrated by the CM. 

41 Benefits of Temporal Partitioning 

Temporal partitioning targeting the XPP can reduce, 
"when efficiently applied, the overall execution time. Such 
reduction can be mainly achieved by the following is- 
sues: (I) reduction of each partition complexity can reduce 
the interconnection delays (long interconnections may pass 
through registers and thus add clock cycle delays); (2) re- 
duction of the number of references, in the section of the 
program related to each partition, using the same resource, 
by distributing the overall references among partitions, can 
lead to performance gains as well. This happens with the 
statements presented in the program referring the same ar- 
ray; (3) reduction of the overall configuration overhead by 
overlapping fetching, configuration and execution of dis- 
tinct partitions, 

Example: Consider the C example max_avg shown 
in Fig.5. Configuration boundaries are represented by 
XPP_next_conf[) statements. They define four configura- 
tions in the code (see Hg.6). Apart from exposing temporal 
partitions in such a way that the mapping to XPP is accom- 
plished, combining only the most frequently taken condi- 
tional paths in the same partition can reduce the total ex- 
ecution time by substantially reducing the reconfiguration 
time (since the partitions for the other paths are not config- 
ured when they are not taken). Fig.6 presents such a case. 
If the path bbj) and bb_l has been identified as the most 
frequently executed, this path can be in the same partition 2 . 

z Tail duplication Of bb_3 would permit to have a configuration with 
(bb.,0, bb_l, bbJJ) <wd another One with (bb_A bb_SJi 
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In such a case, the configuration related to bbjL will only 
be called when the most frequent path has not been taken. 

// raa*L_avg example 

If (op=*l) ( // average ke*aal 
sum - 0; 

sum+=Tt Til ^ 

} 

average ~ sura/N; 
y eX$a { // max kernel 
man = 0; 

if(x{i} > nax *« x[ij; 

1 



Figure 5: Example with two conditionally executed kernels- 
and with configuration boundaries represented. 




Figure 6: CFG of the algorithm shown in Fig-5, Lines 
crossing edges represent the XPP_next_conft) statements in 
the code. Bubbles containing basic blocks represent ihe re- 
gions to be implemented in different partitions* 

Since configuration takes many clock cycles, it is in most 
cases preferable co reuse a configuration as much long as 
possible in order to reduce the reconfiguration time over- 
head. Thus, loops in the source code are always good candi- 
dates to be entirely implemented by a $ingle configuration. 

4*2 Partitioning Loops 

Each loop that does not fit onto the XPF can be dealt with 
by performing loop distribution £5) (if applicable) orby par- 
titioning the loop and use the CM to orchestrate the control 
flow. Currently, loop distribution is not automatically ap- 
plied. Instead, we propose a new method to partition com- 
plex loops without restrictions. All the loops which their 
bodies must be partitioned are transformed into straight line 
code with a jump to loop-exit or to the next iteration in or- 
der that each partition can be compiled by MODGen. Fig,7 



shows an example of such transformation without the state- 
ments needed to communicate the value of scalar variables 
between configurations. Bach configuration requests the 
next configuration to be taken (if none is requested then the 
application terminates and the last Configuration releases its 
resources). Depending on the value of the i<N condition, 
config. #2 takes two different exits, which requests #3 or #4 
respectively. Since config. #3 always requests #2, at the end 
of its execution, the initial behavior of the loop is preserved. 
The temporal partitioning creates two additional configura- 
tion boundaries to preserve the initial functionallicy. From 
Hg.7b can be seen that configuration boundaries were in- 
serted before and after the if statement These boundaries 
are needed since the code before and after will be executed 
once and both the if header and body will iterate H+X and 
N times respectively. 



int i; iflfc i; 

i-0; #1 

£or(i-0;i<N/i*+) { lafcl: &£(i<N) f #2 

strati; stmtx; fr2 

XPP^jiext_coaf ( ) ; «tmt2; #3 

stmt 2; #3 

} goto labl; #3 

stmt 3; } £3 

stmt 3 j #4 

a) b) c) 



Figure 7: Example of the transformation applied for par- 
titioning loops, a) original code added with the statement 
representing where the loop is partitioned; b) transformed 
code: c) configuration ID for each statement in b), 

• 

43 Automatic Partitioning 

From the SUIF representation of the C source code the 
temporal partitioning phase constructs an Hierarchical Task 
Graph 3 extended, HTG-k This extended graph has two 
types of nodes: (1) behavioral nodes representing lines 
of code in the input program; (2) array nodes represent- 
ing each array existent in the source code. For instance, 
Fig.8 shows the top level of the HTG+ for an implemen- 
tation of the DCT (Discrete Cosine Transform) based on 
matrix multiplications* Type (I) nodes have three distinct 
sub-types: (a) block nodes representing basic blocks; <b) 
compound nodes representing if-then-else structures; 
(c) loop nodes representing the loops (for, while). Loop 
and compound nodes explicitly embody hierarchical levels. 
Edges in the HTG+ represent data communication between 
two nodes or just enforce execution's precedence. 

Each behavioral node of the HTG-t- is labeled with the 
following information ( some of the labelling steps require 

3 The model has been chosen, because it also exposes loop and task 
level paraletism. 
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estimation efforts): (1) block and compound nodes: num- 
ber of ALUs and REGs; (2) loop nodes: number of iter- 
ations (if unbound, profiling can be used), and number of 
ALUs and REGs; (3) array nodes: the size of the array, 
type of the elements, and, when they do exist, the initializa- 
tion values. Each edge between two behavioral nodes of the 
HTG* is labeled with the number of data words that must 
be transferred between the two nodes. Each edge between 
an array and a behavioral node in the HTG+ is labeled with 
the number of load and store references in the source code 
represented by the behavioral node to that* particular array. 
The estimated number of time? that each load and store ref- 
erence will.be executed is also collected. The use of the 
same array by different behavioral nodes, increases the ex- 
ecution latency and the number of resources needed for this 
partition 4 . 

TempPart uses three types of estimations: (1) number 
of XPP resource units needed by the configuration imple- , 
menting a single or a set of behavior nodes; (2) latency for 
a behavior node or a set of connected behavior nodes on the 
HTG+ (this does not need to be accurate to the real exe- 
cution time and only needs to have relative accuracy); (3) 
number of clock cycles to fetch and configure each parti- 
tion (calculated based on the number of configuration words 
needed, which is computed with the estimation of the re- 
sources needed directly from the SUIF representation or 
with the number of edges. ALUs* REGs, and pre-defined 
values existent in the NML graph generated by MODGen). 




Figure 8: Top level of the HTG+ for the DCT example (this 
top level consists of 4 loops)* Circles and boxes represent 
behavioral and array nodes respectively. Data is read from 
an input port (Loopl) and written to an output port (Loop4). 



4 Brg., twice the number of inferences co the same RAM leads to more 
than twice the number of objects required on XPP and delays each access 
because of the objects needed to MERGE end DEMUX data and address 
packets. Hertce, combining several behavioral nodes in one partition Incurs 
an overtiead wfticn is computed during the temporal partitioning algorithm. 



The temporal partitioning algorithm starts with a parti- 
tion for each node on the top of the HTG+ and then merges 
iteraUvelly adjacent partitions untffl no performance gains 
fire achieved considering the maximum available size for 
each partition. Each partition must currently define, on the 
control flow graph (CFG) of the program, regions of code 
with all entries to the same instruction and possibly multiple 
exists. The algorithm considers the overlapping of configu- 
ration and execution with fetch during the merging of parti- 
tions. The algorithm starts with the granularity of the nodes 
in the HTG+ and only if a block node cannot be mapped 
it considers partitioning at the statement or sub-block level. 
Thus, the granularity of the algorithm adapts according to 
the application needs. 

The temporal partitioning strategy only exploits configu- 
ration boundaries inside loop bodies if an entire loop cannot 
be mapped to the XPP or contains more than one inner loop 
in the same level of the loop body. If these cases occur, the 
algorithm is applied hierarchically to the body of the loop. 

SUIF Representation annotated with data-dependencies 

1 



HTC3+ Generation 

i 



alpha «0 



— ^ Tempqral Part ^ on ^ nq ^g 011 ^™^ * 



0 <m alpha < i 



XPP Parameters 



each TP/; size (TP J) <m MaxSfee(1 -alpha) 



f Estimation (MQDGan) j 



relax alpha 




CneckSize (XMAP) 



*e*a* alpha no 




[yes 

Tf*l done ftrccao TPi an* tho associated HTG+ nodes* 

Figure 9: Automatic temporal partitioning methodology. 

Fig-9 shows the methodology which uses three levels 
(the computational efforts increase from the first to the 
third level): (1) Temporal Partitioning algorithm based on 
the estimation of the needed resources done with func- 
tion costs based on the number and kind of operations in 
the source code. The algorithm uses the HTG+ and the 
SUIF representation of the program; (2) For each config- 
uration, selected in the first level t the estimated sizes are 
checked with the ones estimated by generating the NML 



» 
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graph with MODGen. If the size surpasses the available 
resources, the algorithm rerun level (1), relaxing the size 
constraint (diminuishing the maximum number of avail- 
able resources); (3) Check if each configuration success- 
fully checked In level (2) can be really mapped to the XPR 
This level uses functions of the mapper, placer and router. 
If the configuration cannot be implemented in the XPP f the 
algorithm returns to level (1), once more relaxing the size 
constraint The size constraint is relaxed by reducing the 
alpha parameter in each backward iteration (see Fig-9). 

After exposing the configurations with TempPart, the 
compiler introduces the statements needed to communicate 
scalar variables between partitions (see Fig.10). Arrays are 
nsed as inter-partition storage for scalar variables too, since 
only RAMs (to which the arrays are mapped) keep their data 
during reconfiguration. TempPart also ensures that arrays 
used by more than one configuration, or by the same config- 
uration loaded more than once onto the XPP, are bound to 
the same memory location and such location is not used by 
other arrays during the lifetime of the array variable. The 
assignment of all arrays (the -initially used in the source 
code plus the added ones to communicate data) to the in- 
ternal memories is done based on the lifetimes of the arrays 
determined by the sequence of configurations that were pre- 
vtoutey exposed in the input program. This permits, in some 
cases, to use less internal memories since they can be time 
shared, among different configurations. 

int commll]; 

» • « • * . $1 

a - b * C; a • b * #1 

X*2_nexfc_con£ ( > ; eoxnmtO] a; #1 

d *■ a/e; & - Qomm[0}; #Z 

<X - a/e; #2 

a) b) e> 



Figure 10: Example illustrating the communication of the 
value of a scalar variable between two configurations, a) 
source code; b) code with statements inserted to buffer the 
data; c) configuration ID for each of the statements in b). 

4.4 Generating the NML Application 

Each partition is input to MODGen, which generates the 
NML structure to be mapped to the XPP. MODGen gener- 
ates, for each exit point existent in each partition, an event 
connected to one of the CM ports available in the XPP 
(the CM can check if an event is generated and can pro- 
ceed with different configurations based on the value of the 
event). The compiler generates both the NML representa- 
tion of each partition and the NML section specifying the 
control flow of configurations. Such control flow is orches- 
trated by the CM of the XPP during runtime, as has been 




already explained. 

The compiler also generates NML code considering the 
pre-fetch (load of a configuration to the cache of the XPF) 
of configurations. The compiler can furnish two different 
strategies: (1) request of the pre-fetch of all configurations 
existent in the application in the start of die execution; (2) 
request in each configuration of the pre-fetch of the next. 
The request is done before the start of the configuration 
step for the current configuration, Strategy (1) is used most 
of the times. However, there are cases where using (2) is 
better. In the presence of several nested if-tshen-eise 
structures with different configurations for each branch, a 
pre-fetch sequence defined at compile time can introduce 
too much overhead. 



.. 5 Experimental Results 

#» 

Tab.l shows some results obtained when compiling a set 
of benchmarks with the XPP-VC. Note that none of the 
examples shown was specially coded to exploit more ef- 
ficiently the architectural features of the XPP (e.g., parti- 
tioning and distribution of arrays among the internal mem- 
ories) and thus the results can be farther improved. An XPP 
Core with a single PAC was used. The 2nd column repre- 
sents the size of the PAC (number of columns and rows of 
PAEs) used for each example. Columns #cf, SPAE, #Lat, 
and #max represent the number of configurations, num- 
ber of PAEs used (it is shown the maximum number of 
PAEs of the largest configuration and the total number of 
PAEs virtually needed), overall latency (taken into account 
setup, fetching, configuration, data communication and exe- 
cution), and the maximum number of objects executing per 
cycle respectively. The last column shows the CPU time 
(using a Pentium IH @933MHz with Linux) to compile 
each example (from the source program to the generation 
of the binary configuration file). 

DCT1 i$ a 8x8 discrete cosine transform implementa- 
tion which is based on two matrix multiplications. The al- 
gorithm uses 6 loops for the multiplications and 2 loops to 
stream I/O data. It is purelly sequential (no unrolling is 
used). Temporal partitioning improves the overall latency 
of DCT1 by 13% and uses 3 1 PAEs (without partitioning 51 
PAEs are used). Thus it can use a smaller XPP core. DCT2 
uses the DCT kernel of DCTl and traverses an input im- 
age of a pre-defined size (16x16 is used). It uses 2 external 
memories to load/store the image and 2 internal RAMs for 
intermediate results and to store the coefficients. The ver- 
sion with 6 configurations was obtained performing tempo- 
ral partitioning. Since the example has two outer loops the 
scheme to partitioning loops was applied (the compiler uses 
one configuration boundary between the two main loops of 
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the DOT kernel). With this scheme, a gain of 4% in per- 
formance was achieved using 30% tess.PABs. Chen is a 
pointer-free version, with 1 80 lines- of C, of a DCT imple- ' 
mentation used in JPEG. Temporal partitioning furnished 
an improved version: 66% in performance using 12% less. 
PAEs. The computation and data-communication is per- 
formed in 688 clock cycles, smooth represents an image fil- 
ter (16x16 image). The two inner loops (3x3 window) were 
annotated to be unrolled and conducted to an efficient vec- 
tori2ation. An overall speedup of 4 (8.6 considering only 
execution time) over the implementation obtained without 
unrolling is obtained. Aditionally, 2 less PAEs are used 
with unrolling. Haar is an implementation of the forward 
2D Haar wavelet transform. An input imago of 16x16 is 
used. A performance gain of 36% is achieved when tempo- 
ral partitioning i$ applied. FIR is a ID FIR filter with 12 
taps filtering 2048 samples. Even with all the overheads* 
0.42 samples/cycle is computed (0.87, considering only the 
latency to communicate data to external memories and the 
FIR computation). 

Each one of the examples was compiled in less than 5 
seconds. This reveals that it is possible to have runtimes 
comparable to the ones achieved by software compilation- 
Performance gains obtained with temporal partitioning are 
shown. Since we run most examples with small data and 
image sequences, the configuration overhead is significant 
Note that the current methodology does not use neither 
the full potentialities of the XPP nor some optimisations: 
( 1 ) The execution of a partition only starts after the full con- 
figuration Of its resources; (2) No pipelining between fetch 
and configuration for the same partition has been used; (3) ' 
The capacity of the XPP to configure concurrently distinct 
PACs was not used; (4) An arbitrary order for prefetching 
of configurations conditionally requested is used (the order 
should be based on the most frequently taken path, e.g., de- 
termined by profiling); (5) The configuradon FIFOs in each 
array-object were not used. Hence, the performance results 



can be further improved. 



6 Related Work 

The XPP technology offers a promising reconfigurable 
computing platform. Being a step forward in the context of 
reconfigurable computing, it permits to attack some of the 
well-known deficiencies of related technologies. The fol- 
lowing sub-sections illustrate the most closely related work 
and reveals the most important differences. 

6.1 High-Level Compilation 

The work on compiling high-level descriptions onto re- 
Configurable logic has been the focus of many researchers 
since the first simple attempts [7], Most of this work targets 
FPGA devices and thus need logic synthesis, even when 
module generators are included in the compilation flow, as 
is the case with the MARGE [8] compiler. In addition; such 
approaches also need backend mapping, place and route, 
which are very time consuming with FPGA technology. 
Even when pre-placed and pre-routed components are used 
to assist the compilation flow, the compilation time is still 
in the order of minutes or hours. 

New approaches have been used, which target research 
architectures. One of those approaches i$ the Garp-C com- 
piler [9], Although it is used for a reexmfigware/software 
architecture, the configuration bit stream generation, based 
on exploitation of instruction-level parallelism beyond ba- 
sic blocks and assisted with fast mapping and placement 
tasks permits to target fine-grain reconfigurable architec- 
tures efficiently with short compilation times. 

As Garp-C and MARGE, XPP-VC also uses the SUIF 
compiler front-end. The generation of the hardware struc- 
ture to be mapped to the XPP is assisted with the pipeline 
vectorization ideas presented in [6]. However, the gener- 
ation of the control structure, based on the event packets 
of the XPP is completly new. Since the XPP is a coarse- 
grained architecture, which directly supports arithmetic and 
other operations occuring in high-level languages, there i$ 
no need for complex synthesis and mapping. The control 
structure is also directly mapped to objects handling events. 

6-2 High-Level Temporal Partitioning 

Temporal partitioning at the behavioral level has been al- 
ready successfully conducted for FPGAs and other type of 
RPUs. The majority of the current approaches try to use a 
minimum number of configurations by using all the possi- 
ble RPU size available for each temporal partition (see, for 
instance, (I0]>. Such schemes only consider another parti- 
tion after the current one has filled the available resources 
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and are insensible to the optimization that must be applied 
co reduce the overall execution by overlapping the fetching, 
configuration and execution steps. Albeir not considering 
such optimizations, DLP formulations presented by some 
authors [2 J] are uncapable to deal with the complexity of 
many realistic applications. 

One of the first attempts to reduce the configuration over- 
head in the context pf temporal partitioning has been pre- 
sented in {12]. However* the approach uses the simple 
model of splitting the available FPGA resources into two 
parts and performing temporal partitioning using half of the 
total available area as the size constraint. The scheme only 
overlaps configuration with execution of adjoining parti- 
tions and does not take into account the prefetch steps that 
can be efficient! y used in some RPU architectures. Farther- 
more, the approach causes problems, when some resources 
of the RPU must be shared by two or more partitions. This 
contradicts the requirement of disjoit spaces of the RPU 
used by two adjacent temporal partitions. 

The temporal partitioning algorithm used in the XPP-VC 
compiler is based on some ideas presented in [I3) m The spe- 
cial characteristics of the algorithm to deal with resource- 
sharing during the creation of the partitions are not used 
and special heuristics have been added to deal with the fetch 
and configuration time of each partition. The purposed par- 
titioning of loops was firstly introduced in this paper. The 
scheme can deal with any type of loops. The previous ap- 
proaches consider loop distribution when a loop does not 
fit onto the RPU [14]. However, loops which cannot be en- 
tirely mapped onto a single configuration and which cannot 
be distributed are not compiled. Our method can deal with 
programs with unlimited complexity as long as the sup- 
ported C subset is used. It does not depend on the feasibility 
of a specific compiler transformation. 



7 Conclusions and Future Work 

This paper describes the new Vectorizing C Compiler, 
XPP-VC, which maps programs in a C~subset extended by 
port access functions to PACTs XPP architecture. Assisted 
with a fast place and route tool, it furnishes a complete 
"push-button" path from algorithmic descriptions onto XPP 
configuration data with short compilation times. 

An innovative temporal partitioning scheme is pre- 
sented. It enables the mapping of complex programs and 
furnishes XPP aplications with performance gains by hid- 
ing some of the configuration time, A new mechanism to 
handle partitioning of loops, which supports loop execution 
by the configuration manager of the XW» is also presented. 
Furthermore, the compiler generates self-contained config- 
uration data even when several configurations are exposed. 



Ongoing work focuses on tuning che estimation steps to 
assist automatic temporal partitioning and on improving the 
configuration data generated. 

In addition to loop unrolling, hop merging, loop distri- 
bution and hop tiling will be used to improve loop han- 
dling, i.e., enable more parallelism or better XPP usage. A 
future extension of the compiler for a host-XPP hybrid sys- 
tem is planned. The compiler will map suitable program 
parts, especially inner loops, to the XPP, and the rest of the 
program to the host processor, 
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2 SPECIFICATION OF CONFIGURATIONS 



(2) can have more impact if the map, place, and route phase could, as most as 
possible, to confine temporally adjacent conizations in distinct locations of 
the XPP (augmenting the concurrency between execution and configuration). 

For reconfigurable computing platforms, where the reconfiguration of the 
array takes several clock cycles, is in most of the cases preferable that a configu- 
ration is reused as much time as possible in order to reduce the reconfiguration 
Overhead. Thus, loops in the source code are always good candidates to be 
entirely implemented in a single configuration. 

2 Specification of Configurations 

Configurations can be specified by the programmer using XPP next confQ 
• statements in the source code of a given application. Such statements must 
expose, on the control flow graph (CFG) of the procedure, regions of code with 
all entries to the same instruction and eventually multiple exists: The compiler 
exposes the configurations, removes such statements from the SUIFl[2] interme- 
diate representation, checks for invalid specifications of configuration boundaries 
(when the statements expose regions with entries to different statements in a re- 
gion of code, or when code can be contained in more than one region 1 ), insertes 
the code responsible to the data communication between ternporaf partitions, 
and generates both the NML (Native Mapping Language) [8] representation 
of each configuration and the application section specifying the control flow 
of configurations. Such control flow is orchestrated by the CM (Configuration 
Manager) of the XPP during runtime. 

Consider a pointer-free version of the quantizer of an h2S3 implementation 
[3] which code is shown in Fig. 1. Four XPP_nexb_canf() statements were 
inserted in the code to Specify three configurations. The configurations specified 
are represented in the CFG of the example that can be seen in Fig. 2. Apart 
from specifying temporal partitions in such a way that the mapping to XPP 
is accomplished, there can be the case that; merging only the mostly taken 
conditional paths m the same configuration can reduce the total execution time 
by Substantially reducing the reconfiguration time (since the partitions for the 
other paths are not configured when they are not taken). F*g- 2 presents such 
a case. If the path bb_Q, bb_l and bb_2 was identified as the most frequently 
executed, such path can be specified to be in the same configuration 2 . Ih such 
a case, the configurations related to bb_3 and bb 4 will only be called when 
the moat frequently path has not been taken. In some examples, paths are only 
executed in "degug mode" (as is the case of the branch taken when QP evaluates 
to false in the source code of Fig. 1). 

After exposing the configurations, the temporal partitioning phase intro- 
duces the statements needed to communicate scalar variables between two dif- 
ferent configurations (see Fig. J ). Currently, the scalar variables are stored in 

*1ail duplication could be applied in some examples. 

^ *™tw 'flT ^^VL**??* 10 ** ve a «">*SuratIon with {bb_Q. bbj, bb 2, 
bb_5}; another one with {bb_4, bb_5}; and another one with {bb_3 f bblfc 

Joao MP Cardoso^PACT Jbformationstechnologie GmbH, 
http://www.pactcorp.com, December 18th, 2001* 
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if CQP> { 

if <Moda ^ 1 1 Mode = WODE_XHTRA~Q> { /* Inrra. */ 

qcoeff CO] * naaax(l t nuniiL(254 f coeff C03/8))J 
for (i - 1; i <; H; { 

level = <abs(coeffCi3)) / <2*QP>; 

qcoe«[i] « nmiin<l27,nmiax(-127»sig?l(cOeffCi3) * level)); 

XPP_next_conf () ; 
> else { /* non Intra */ 
XPP_next^conf () ; 
for Ci 0; i < M; i++) { 

level « absCcoeff CtD-QP/2) / <2*QP)j 

qcoeff LU = mmHtCi27,mmax<-i27, sigtt(coeff Ci]) * level»j 



XPP^next^ccnf O ; 

> 

> else -( 

XPP_next_conf (); 
/* tfo quantizing.*/ 
for (i « 0; i < Mj i++) -C 
qcoeff Ci] « coe«Ci]; 



Figure 1: C source code of the quantisation algorithm with configuration bound- 
aries specified, 



XPP_next_conf (); 



> 



JoSo M P Cardoso^PACT Ihfonnationstechnologie GmbH, 
http://wv-pacteorp,com, December 18th, 2001 
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Figure 2: CFG af the algorithm shown in Fig- 1- The lines crossing edges 
represent the XPP_ nesct_ confQ statements in the code- The bubbles containing 
basic blocks of the CFG represent the exposed regions of the CFG that are 
implemented in different temporal partitions. 



Joao M P CardosogPACT Informationstechnologie GmbH 7 
htfcp;//www.pactcorp-com, December 18th, 2001 
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3 MAPPING APPLICATIONS WITH CONFIGURATION BOUJSDAEJES 
IN LOOP BODIES 
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Figure 3: Example illustrating the communication of the value of a scalar vari- 
able between two configurations, a) source code; b) source code with statements 
inserted to buffer the data? c) configuration ID for each of the statements in b). 

arrays specially inserted in the SUIFl representation of the given application. 
Those arrays are mapped to internal memories of the XFF 3 . The temporal par- 
titioning phase also e assure that arrays used by more than one configuration, 
or by the same configuration loaded more than once to the XPP, are binded to 
the same memory location and such location is not used by other arrays during 
the lifetime of the array variable 4 

The assignement of the overall exist ant arrays (the initially used in the source 
code more the added ones to communicate data) to the internal memories is done 
based on the lifetimes of the arrays determined by the sequence of configurations 
that were previoulsy eacposed in the application. This permits, in some cases, to 
reduce the number of internal memories needed by time sharing, among different 
configurations, some internal memories during the execution of the application 
on the XPP. 

The XPP-VC compiler generates, for each" exit point existant in each config- 
uration, an event connected to one of the CM ports available in the XPP (the 
CM can check if an event is generated and can proceed with different configu- 
rations based on the value of the event). The generated event has value CJ 0" if 
the path that activates that exit is taken and U V 9 otherwise, 

3 Mapping Applications with Configuration Bound- 
aries in Loop Bodies 

Configuration boundaries in loop bodies can be deal by performing loop distri- 
' bution (as long as it can be app lied) or by temporal partitioning the loop and 

Sjxx thia caae > Ik® internal memories are used as data buffers for the xnantainance of the 
original program behavior. 

4 At the moment each configuration must use a number of array variables, to be assigned 
^^J" 13 * memories of the XPP t less or equal than the number of internal memories of 
the XPP (the compiler assigns each array to a distinct memory). However, the total number 
of arrays eastant Ou the overall configurations can surpass the number Of internal memories, 
« some memories can be shared Between configurations due to the non-overlap of the life time 
of array variables. One data stored in memories is mantained across reconfigurations. 
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3 MAPPING APPLICATIONS WITH CONFIGURATION BOUNDARIES 
IN LOOP BODIES ■ 



int i* 


int i; 






ft ■ a 


i=0j 


conf 


#i 


forCi«0;i<N;i++) -C 




conf 




statement Ij 


Statement 1; 


conf 


« 


XPP^next.couf () ; 


statement^ ; 


conf 




statements; 




conf 


#3 


> 


goto labl; 


conf 


#3 


statements; 


} 


conf 


*3 




statements; 


conf 




a) 


b) 


c) 





Figure 4; Example of the transformation applied to Loops with configuration 
boundaries in their bodies, a) original source code; b) transformed code; c) 
configuration ID for each statement in b). 

use the CM to orchestrate the control flow. 

Currently, loop distribution [ll] is not automatically applied 5 - All the loops 
with configuration boundaries specified in their bodies ate transformed into 
if() goto label land loops in order to permit the NML generation by the XPP- 
VC compiler. Pig- 4 shows an example of such transformation without the 
statements needed to communicate the value of scalar variables between config- 
urations. The 3rd column shows the configuration ID of each statement. Bach 
configuration requests the next configuration to be taken (if the eadfc taken is to 
the end then only the "reconf 16 of the configuration is done). Conf #2 needs a 
conditional request mechanism to call conf #3 or conf #4 based on the value 
of the i<N expression. Since conf #3 always requests, at the end of its exe- 
cution conf. #2, the initial behavior of the loop is maintained. The temporal 
partitioning task also creates two more configuration boundaries to preserve the 
initial functionalliry. from Fig. 4b) can be seen that configuration boundaries 
were inserted before and after the if statement. Such boundaries are needed 
since the code before and after will be executed once and both the if header and 
body will iterate N+l and M times respectively. 

The configuration boundaries inserted in loop bodies must specify, at the 
scope of the loop body, the permitted type of regions (already explained). 

Loop distribution (also known as *Ioop fission") will be the preferable form to 
implement loops, which generated NML does not entirely fit in the available re- 
sources of the XPP. Such transformation can potentially lead to the introduction 
of temporary arrays. Consider the loop shown in Fig. 5 where a configuration 
boundary is specified. The loop can be splitted so that the two statements are 
each one in one loop and the configuration boundary is now outside any loop 
body. However, we need to sca lar expand variable 3 in order to mantain the 

3 The compiler should check if the loop distribution can be applied on each temporal par- 
tition boundary exfetant in loop bodies. 

*A reconf meanfi that the resourcea used by that configuration axe released and then can 
be reconfigured. 
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3 = • - . // statement 1 

XPP^nert^conf () y 

— = s // statement 2 

> 



• » • 

xorCi-0; i<N; { 

trapes (i) = // statement 1 

> 

XPPjiext^conf (); 
for(i=0; i<W; { 

tmp^sCi) // statement 2 

> 



Figure S; Applying loop distribution as another way to enable temporal parti- 
tioning on loop bodies. 



initial functionallity (one array with the size of the number of iterations of the 
loop must be declared and ia used to communicate each -5 value in each of the 
loop iterations)- 



4 Execution Strategies 

To reduce the overal latency, efficient exploitation of the pipelining of the 3 steps 
presently in each temporal partition (fetching, configuring, and array execution) 
must be conducted. 

Two "recoup* modes can be used (the user can select one of the modes in the 
options of the XPP-VC compiler related to temporal partitioning): 

• "recoaF executed by the CM. In this case each configuration communi- 
cates with the CM sending an event, when the completion of execution, 
to request the next configuration. This next configuration starts by exe- 
cuting a 'Veconf command to an XPP resource of the configuration (that 
command is broadcasted throughout all the resources used by the config- 
uration, and so the resources will be released and can be reconfigured by 
the next configuration). When a configuration can be requested by more 
than one previous configuration, special configurations are inserted in the 
Temporal Partition Control Flow Graph (TP CFG 7 ) between each source 
and the sink. Such special configurations only command the "reconf" of 
a resource in the XPP of the previous configuration and request the next 
one. This type of "reconT doe? not permit to have overlapping between 
execution and configuration between temporal partitions? 

• "teconf* self applied by each configuration. In this case each a>nfiguration 
at the end of the executio n broadcasts a "reconf" event to aU the XPP 

l?^*?* 0 * 0 *™ * < ?^ ec * e<i ' eventually cyclic, graph where each node represents a cpnfig- 

Sfl^-^^If?^ Md **** behraen ***** specifies the execution flow of 
the application thw ugh its temporal partitions. There * only one edge between two nodes pf 
the graph and each node represents a region of the CFG of the application. 
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resources pertenring to it- This mode does not need addition of special 
configurations by the compiler and permits that the CM try to configure 
the next configuration called during the execution of the caUes temporal 
partition (when only one configuration path is presented)- 

The compiler also generates NML code considering the pre-fetch (load of a 
configuration to the cache of the XPP) of configurations. When the pre-fetch is 
enabled two strategies can also he automatically used: 

• request of the pre-fetch of all configurations existant in the application in 
the start of the execution (during this pre-fet ching the flow of configuration 
and execution is done in the way it is specified in the application section 
of the NML file); 

• request in each configuration of the pre-fetch of the next- The request is 
done before the start of the configuration step for the current configura- 
tion. 

The CM of the XPP permits also speculative configuration of a temporal parti- 
tion that can conduct to better performance results even when the Map, Place 
and Route does not try to locate temporal partitions in non-overlapping areas 
of the XPP. The strategy tries to configure the partition speculatively used 
after the configuration of the Current one. If the path which includes that con- 
figuration is taken, the CM only has to enable the start of the execution of the 
configuration (see the section of the NML code in Fig. 6 and the simulation 
results in Fig. 7, where conf_MOD2 is speculatively configured during the ex- 
ecution of conf_MOD0). When such path is not taken, the CM releases the 
resources already configured and requests the other configuration. 

5 Automatic Temporal Partitioning 

Automatic temporal partitioning permits the automatic exposing of configura- 
tions oriented by two distinct goals: 

• minimum number of configurations; this goal can be achieved with al- 
gorithms that try to use all the available reconfigurable processing units 
during the assignement of segments of behavioral code to the same con- 
figuration; 

• minimum overal latency: this goal can be achieved by considering the costs 
to load into the cache, to configure and to execute each configuration with 
the XPP array. An important strategy that must be considered is the use 
of pre-fetch of configurations while one of the others is running. Arrays 
of constants or with pre-defined values used in one or more configurations 
can be initialized in one of the previous configurations if such one exists* 
This takes advantage of the initialization of the array carried out by using 
the configuration bus- 
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CONFIG conf J40DO { 

C0NF.M0DUIE<M0D0) // request the configuration of MODO 

SETCHODO.Iteconf .E m 1) // Gim bi 0 the Self releasing pf resources 
REQUESTCconf _MQD2..spec) // speculative configuration 
// if CMQDO.CMPortO = "0*0 then conxJK0D2_exec is requested 
// else continue 

// if (MODO.CMPortl *«o»0 i^gjt eonf JJQD1. is requested 
C0WF.CMPQRTCM0DO.C«Port0 7 conf^MQD2_exec, _) // take M0D2 or continue 
^ CONF^CHPQRTCMODO , CMPortl , confJfQDi,, _) // take MODI 

COttFlG- conx_M0D2_6pec { 
C0MF.M0DT7LECMQD2) 

CONFIG conf_M0D2._exec { 
SET CH0D2 . Start . A * 1) 
SET (K0D2 . Beconf . E » 1) 
REQUEST (conf ^M0D3) 

> 

COHFIG conf JfODl { 

REqUESTCconf - M0D2_rec> 
C0WFj«3DXJtE<M0Di) 
SET (MODI. Beconf .E - 1) 
REQUEST (opuf _H0D3) 



// request the configuration Of MQD2 
// MQD2 is taken 

// enable the start of computing; of H0D2 
// enable the self release of resources 
// request the next configuration 

// MOD* is taken 

// request the releasing of resources 
// request the MODI 

// enable the self releasing of resources 
// request the next configuration 



C0HF2G conf J*0D2_rec { 
REC0NF<MODl .Start) 



// release the resources' of MODI 



Figure 6: Example of a section of NML code describing the speculative config- 
uration concept- 
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Figure ?: Example of the Overlapping among the fetch, configuration, and 
CUtiotX steps of different temporal partitions. 



from the SUIFX representation of the C source code the temporal partitioning 
phase constructs an extended WIG (Hierarchical Task Graph) 8 * Such extended 
graph had 2 types of nodes; 

X. behavioral nodes representing source lines of code in the input program; 

2. array nodes representing each array existent in the source code. 

Type (1) nodes have 3 distinct sub-types: 

1. block nodes representing basic blocks whith one-entry and a single exit; 

2. compound nodes representing if-then-else structures; 

3. loop nodes representing the loops (for, while, ect*)* Loop and compound 
nodes explicitly embody hierarchical levels. 

Edges in the HTG4- represent data communication between two nodes or just 
enforce execution's precedence. 

Each behavioral node of the HTG+ is labeled with the following information 
(some of the labelling steps require estimation efforts): 

• block and compound node s; number of ALUs and REGs; 

8 Tbe model to been chosen, because it will also permit to exploit loop and task level 
paraleiiszzia 
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Figure 8: Top level of the HTG+ for the DCT example (this top level consists 
of 4 loops). Circles and boxes represent behavioral and array nodes respectively 
Loop I reads the data-stream from an input port to an internal memory and 
Loop 4 writes the data-stream generated by the DOT code (Loops 2 and 3) from 
an internal memory to an output port of the XPP. 

• loop nodes: number of iterations (unknown if unbound), and number of 
ALUs and KEGs; 

• array nodes: the size of the array, type of the elements, and, when they 
do exist, the initialization values. 

Each edge between two behavioral nodes of the HTG+ Is labeled with the 
nmnber of data words that must be transferred between the two nodes. 

Bad» edge between an array and a behavioral node m the HTG+ is labeled 
with the number of load and store references (,. = A[i] and A[i] = respec- 
tively) m the source code represented by the behavioral nod* to that particular 
array. The estimated number of times that each load and store reference will 
be executed is also collected. Such information is used to calculate the penally 
when two or more behavioral nodes are merged into the same temporal parti 
faon. Such penalty is related to the use of the same array by different behavioral 
nodes and adds an overhead to the execution latency of that temporal partition 
and to the number of resources needed for its implementation. 

nv* * 2? *? Iev f 1 of ^ HTG+ to implementation of the DOT 

(Discrete Cosine Transform) based on matrix multiplications. 

The automatic temporal partitioning phase needs 3 types of estimations: 



Joao M P CardcsolJPACT Informationsteclmologie GmbH, 
http://www.pactcorp.com, December 18th, 2001 



039 20.01.2003 23:18:23 



5 AUTOMATIC TEMPORAL PARTITIONING 




f Check Size (XMAP) 



relax alpha no 




TPi<hm (freeze TPr and the asasdatgd nfed HTSt nodes) 
Figure 9; Automatic Temporal Partitioning methodology. 



• number of XPP resource units weeded by the configuration implementing 
asingteor a set of behavior nodes; 

* latency for a behavior node or a set of connected behavior nodes on the 
HTG+ (this does not need to be accurate to the real execution time and 
onty needs to have relativeness accuracy); 

■ number of clock cycles to fetch and configure each temporal partition 
(calculated based on the number of configuration words needed*). 

The temporal partitioning strategy does not exploit configuration boundaries 
inside loop bodies, unless the entire loop cannot be mapped to the XPP. The 
generation of this type of temp oral partitions never produces better results 

^Caa be estimated by the number ofedg«, ALU nodea, REG nodes, and pre-ddfined values 
eostant m the hardware graph generated by the XPP-VO compiler. 
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(at least when the loop behavior is enssured by the CM). The justification is 
supported by the ^utilization of j^urces, already configured, achieved when 
the entir^Ioop is implemented by a single configuration. When a bop does not 
fit m the XPP, the algorithm ia applied hierarchicaly to the body of the loop. 

Fig. 9 shows the methodology. The strategy works around 3 levels (the 
computational efforts increase from the first to the third level): 

1. Temporal Partitioning algorithm based on the estimation of the needed 
resources done with function costs based on the number and kind of op* 
erations in the source code. The algorithm uses the HTG4- and the STJIF 
representation of the program; 

2. Pbr each configuration, selected in the first level, the estimated sizes are 
checked with the ones estimated by generating the NML graph with the 
XPP-VC compiler, if the size surpasses the available resources, the al- 
gorithm rerun level one, relaxing the size constraint (diminuishing the 
maximum number of available resources); 

S. Check if each configuration succes$fully checked in level 2 can be really 
mapped to the XPP. This level uses functions of the mapper, placer and 
router, if the configuration cannot be implemented in the XPP, the algo- 
rithm returns to level 1, once more relaxing the size constraint. 

The temporal partitioning algorithm used is based on the ideas presented in 
{7]. The special characteristics of the algorithm to deal with resource-sharing 
during the creation of the temporal partitions have been removed and special 
heuristics have been added to deal with the fetch and configuration time of each 
temporal partition. The algorithm tries to overlap configuration and execution 
with fetch during the selection of the HTG+ nodes to each temporal partition, 
(describe more] 

6 Discussion 

We call the attention of the reader for the fact that the current methodology 
does not use neither the full potentialities of the XPP nor some optimizations: 

1. The execution of a given temporal partition only starts after all the used 
resources have been configured; 

2. No pipelining between fetch and configuration has been used. The config- 
uration of the XPP resources for a specific temporal partition only starts 
after its configuration words are fetched (loaded to the XPP cache); 

3. No overlapping on execution between two or more configurations; 

4. The capacity of the XPP technology to configure concurrently distinct 
PAOs (each PAC has its own CM); 
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5. An arbitrary order for fetching of temporal partitions conditionally re- 
quested is used (the fetdrog order should be done by the most taken path 
determined by profiling); 

6- Behavioral nodes exposed in the HTG4- as concurrent nodes are not at the 
moment implemented, by the XPP-VC compiler, with parallel execution. 

Thus, we strong believe that there still be potential to improve the performance 
results achieved when using XPP-VC, 



ported by tools that permit to compile algorithms in C. Being a step forward 
in the content of the reconfigurable computing it permits to attack some of the 
well deficiencies presently in many, if not all, other reconfigurable computing 
technologies- However, some of the work being done to augment the potential 
of such technology has sources in some works previously done. 

Temporal partitioning has been already successfully conducted for FFGAs 
and other type of HPUs, The majority of the current approaches try to use a 
rnjnlmum number of configurations by using all the possible KPU size available 
for each temporal partition (see, £ot instance, [4])_ Such schemes only consider 
another temporal partition after the current one has fulfilled the available re- 
sources and are insensible to the optimization that must be applied to reduce 
the overall execution by overlapping the fetching, configuration and execution 
steps. Albeit not considering such optimizations, JLF formulations presented 
by some authors [5] are uncapable to deal with the complexity of many realistic 
examples. 

One of the first attempts to reduce the configuration overhead in the context 
of temporal partitioning has been presented in (6}* However, the approach uses 
the simple model of splitting the available FPGA resources into two parts and 
performing temporal partitioning using half of the total available area as the size 
constraint. The scheme only overlaps configuration with execution of adjoining 
partitions and doe$ not enter into account to the pre-fetch steps that can be 
efl&ciently used in some KPU architectures. Sfcrtherrnore, the approach can 
originate some problems, when some resources of the KPtJ must be shared by 
two or more partitions (eliminating the requirement of disjoit spaces of the RPU 
used by two adjacent temporal partitions), 

[12]present3 the scheduling of kernels (sub-tasks) targeting the Morphosys 
architecture. They use an efficient search prawning scheme added to an heuris- 
tic that permits to consider firstly solutions which potentially conduct to the 
best performance results. However, they mainly orient the search to data re-use 
among the schedul kernels which is only suitable to type of reconfigurable com~ 
puting architectures where no local memories to the RPU are available. The 
scheduler tries to overlap computing and data transfers and minimize context 
reloading, which as we can see from the examples shown can not always conduct 
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, to the overall minimum latency. The scheme needs as input the application flow- 
graph (without concurrency and conditional paths) and the kernel timing. The 
approach does not consider temporal partitioning and so needs that each kernel 
configuration does not exceed the context memory size. 
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1 Benefits of Temporal Partitioning 

S^CSSn^Sir^^ ^ than to simple 

confieurable P™™^*, tt ^f ntw whlch mapping onto the KPU (Re- 
wnXSon » *! reS ° UI ! eS e " aaot be ««mpliawd by orjy one 

SK^S" J*lZ£^u% aS ? r * targeting the XPP 

* ^f fo ,1°l th * lengths, by reducing each design eom- 
fS^i, fi f RiSh bettec Performance results (longkSerconneSsSS 
through registers and thus adding dock cycle delays); P 

* h2 U ^!°^oL?° h Tf 0131 P^Hon complexity can itself reduce the nunv 
ber of registers raed for vertical routing; < 

* reduction of the number of references, in each temporal partition usin* 
the same resource, by distributing the overall references aWg 

L^f fi T Sh b6tter P«&nnan<» results as welL Thfs happens 
with the statements presented in the program referring the same aSyT ' 

* ^"^n of the overall configuration overhead by overlapping fetchine 
configuration and execution Of distinct temporal petitions? ^ 

^i!?" 0 * 100 ^ ' he wnfiguration overhead is due to 3 distincts sources of 
overlapping, possible with the XPP architecture: sources of 

S^°f-! Ut ' S F e i Ueafc configDrations 5nt ° the cache in parallel with the 
configuration of the current one; 

2. execution of one configuration while the next one is being configured; 

3 " SKIT 011 ° f ° nS C ° nfisurat5on the next one is being loaded into the 
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Abstract 

Resource vtrtualizallon on FPGA devices, achievable 
due to its dynamic reconfiguration capabilities, provides 
an attractive solution to save silicon area. Architectural 
synthesis for dynamically reconfigurable FPGA-based 
digital systems needs to consider the case of reducing the 
number of temporal partitions (reconfigurations), by 
enabling sharing of some Junctional units in the same 
temporal partition. This paper proposes a novel algorithm 
for automdted datapath design, from behavioral input 
descriptions (represented by a dataflow graph), which 
simultaneously performs temporal partitioning and 
sharing of functional units. The proposed algorithm 
attempts to minimize bath the number of temporal 
partitions and the execution latency of the generated 
solution. Temporal partitioning, resource sharing, 
scheduling, and a simple form of allocation and binding 
are all integrated in a single task. The algorithm is based 
on heuristics and on a new concept of construction by 
gradually enlarging timing slots. Results show the 
efficiency and effectiveness of the algorithm when 
compared to existent approaches. 

1 Introduction 

The availability of multi-prograrnmable logic devices 
(such is the case of FPGAs - field programmable gate ar- 
rays) with lower reconfiguration times has made possible 
the concept of "virtual hardware" [IIP]: the hardware re- 
sources are supposed unlimited and implementations that 
oversize the resources available on the device are resolved 
by temporal partitioning. Then, the temporal partitioned 
solution is executed by time-sharing the device such-that 
the initial functionality is preserved. This concept prom- 
ises to be an efficient solution to save silicon area [I J. One 
of the applications is the switch among functionalities that 
have mutual exclusiveness on the temporal domain, such 
as the context-switching between codingteecoding 
schemes in communication, video or audio systems. 



Although, even the latest commercial FPGAs, such as 
the Xilinx™ Virtex family [3], do not have mechanisms to 
implement efficiently temporal partitioned functionalities 
and the time of reconfiguration of the overall FPGA is still 
quite high, the importance of the "virtual hardware" con- 
cept has already been demonstrated with computationally 
complex applications [4]. Industrial efforts are under way 
to further improve the capability of the devices to handle 
nmltiple-configurations by storing several on-chip con- 
figurations and permitting the switch between contexts in 
few nanoseconds [5J. 

The virtualt2ation of FPGA resources has been consid- 
ered by several authors while dealing with circuit netlists 
that oversize the available resources on the device dG}l7], 
just to name a few). From the point of view of the design, 
those approaches work at a much low-level of abstraction, 
Without the possibility to exploit tradeoffs between the 
number of reconfigurations and the resource sharing of 
functional units (FUs), for instance. The design automation 
for FFGA-based systems should include temporal parti- 
tioning algorithms able to efficiently exploit the new con- 
cept Tradeoffs among parallelism, communication costs, 
execution and reconfiguration times, and sharing of some 
FUs in the same reconfiguration need to be considered 
during the architectural synthesis phases. 

Sharing of FUs among operations is a technique to re- 
use a single configuration of an FU by more than one op- 
eration of the same type. On the other hand> temporal par- 
titioning is a technique tailored to reuse the available re- 
sources by different circuits (configurations) with the time- 
multiplex of the device. The nodes of a given intermediate 
representation (e-g., a dataflow graph) representing opera- 
tions have to be scheduled in time steps to be executed in 
each temporal partition (TP). Temporal partitioning must- 
preserve the dependencies among nodes (that are already 
temporal dependencies) such that a node B dependent on" 
node A cannot be mapped to a partition executed before 
the partition where node A is mapped. In addition, consid- 
ering sharing FUs during temporal partitioning can con- 
duct to better overall results (lower number of TPs and 
better performance)* 
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Figure la) shows a design flow which integrates tempo- 
ral partitioning prior to the high-level synthesis tasks [83, 
The majority, if not all, of the existent approaches utilizes 
the presented flow £9][10]. Our efforts address architec- 
tural synthesis 1 integrating temporal partitioning and mis 
paper presents* new temporal partitioning algorithm that 
effectively takes into account sharing of FUs, while main- 
taining a small computational complexity. Besides, it is 
sufficiently flexible to target different FPGA devices. 
Figure lb) shows the design flow proposed in this paper, 
where temporal partitioning is integrated in the high-level 
synthesis tasks and is performed simultaneously. 
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Figure 1. Design flow based on high-level 
synthesis for reconfigqrable computing 
systems: a) traditional flow; b) proposed 
flow. 



Example 1. Motivational example. 

Consider the dataflow graph exhibited inFigure 2 (ExI). It 
consists of 4 additions and 2 multiplications. Suppose that 
each adder uses 1 cell and has a latency of X clock cycle, 
each multiplier uses 2 cells and has a latency of 2 dock 
cycles and the maximum resources available on the device 
equals 3 cells. The dataflow graph has a critical path la- 
tency of 4 cycles and needs 8 cells given those FUs (last 
row of Table I). Figure 2 shows an optimal solution (not 
considering the area of multiplexers, registers and control 
unit needed to implement sharing of a specific FU) for the 
example with results shown in the second row of Table I. 
In Figure 2 each gray region identifies operations that are 
mapped to the same FU. The optimal solution is achieved 
with only one adder and one multiplier and fits totally on a 
single TP, When not considering sharing of adders, the op- 
timum result is shown in the third row of Table I. The al- 
gorithm proposed in this paper achieves those optimal re- 

1 There is no distmctka among foe terms: high-level synthesis* arcM. 

tecturet synthesis and behavioral synthesis. 



suits. The fourth row of the table shows the solution ob- 
tained when considering a leveling temporal partitioning 
algorithm that does not consider resource sharing of FUs. 
From this example, it can be seen that resource sharing can 
reduce the number of reconfigurations and can also reduce 
the overall execution latency. There are also cases where 
the critical path latency of the input dataflow graph (last 
row) is maintained (second row). 




Figure 2. Dataflow graph of the example 
Ex1. 



Table I* Results for Ex1. 



Appioaeh 




#XPs 


Execution 
latency 


Resources 
used 


Optimum (sharing of 
adders and multipliers) 




1 


4 


3 


Optimum (sharing of 
multipliers) 




3 


5 


3 






4 




3 


Without Temporal 
Partitioning (no sharing) I 


(4,2) 




4 


8 



The remainder of mis paper is organized as follows. 
Section 2 formulates and explains fee problem. The algo- 
rithm is deeply explained in section 3, where the pseudo- 
code and the overall performed steps are fully elucidated 
through an example. In section 4 experimental results are 
shown and discussed In section 5, related work is de- 
scribed. Finally, in section 6, conclusions are presented 
and further work is envisaged. 

2 Problem Definition 

Given a dataflow graph (DFG), representing a behav- 
ioral description, G » (V, E), topological^ ordered, di- 
rected and acyclic, with |V| nodes, {v^Va...,^} and (EJ 
edges, where each node V| represents an operation and 
each edge ej j € E represents a dependence between nodes 
V| and Vj, A dependence can be a simple precedence- 
dependence or a transport-dependence due to the transport 
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of data between two nodes. The DFG can be obtained from 
an algorithmic input description. Such pre-processing step 
is beyond the scope of this article, but the front-end of our 
Java compiler for recoriigurable computing systems can 
be employed [12]. ,^ 

Here we assume that there is a component library with a 
set of FUs and" there is one FU for each type of operation 
in the DFG. <P represents the set of FUs, from the compo- 
nent library, to be instantiated by the algorithm. Rmax 
represents the resource capacity available on the device, 
R(*0 returns the number of resources utilized by the TP *j 
and R(vi) returns the number of resources utilized by the 
FU instance associated with v f . N(*3 returns a subset of 
nodes of V mapped to fl*. 

Each partition n is a non-empty subset of V, where for 
each node exists a map to one and only one FU instance in 
<P. fl(vO identifies the TP where node Vj fs mapped. The set 
of the TPs is represented by; 

where N represents the number of TPs. A graph G, tempo- 
ral partitioned in N subsets (TPs), is correct if: 

jv 

_ f\N(ft s ) = 0 : each node Vj e V is mapped to only 

one TP (here we do not consider cloning of operations 
in the DFG); 

H 

- \jN0eJ V : all the nodes of V are mapped; 

- V7^e P,R(iiO^Rmax: each TP fits in the resources 
available on the device; 

- V ey € E, JC(Vt) £ n(Vj): &e order of the execution of 
the TPs does not violate the dependencies among op- 
erations of the DFG (necessary condition to obtain the 
same functionality), 

A correct set of TPs guarantees the same overall behav- 
ior of the original graph (when executed from I to N and 
considering a correct communication mechanism to trans- 
fer data among TPs). However, we are also interested on 
the minimization of the overall execution latency. The cost 
that reflects the overall execution latency in a time- 
multiplexed device can be estimated by the equation (1) or 
(2), when partial or foil reconfiguration of the available re- 
sources is considered respectively. CS(#?) returns the 
minimum execution latency (number of control steps or 
clock cycles) of the partitioned solution, CS(ni) refers to 
the minimum execution latency of the TP (it may in- 
clude the communication costs and represents the execu- 
tion latency of the critical path of the graph formed by the 



subset of nodes inland the correspondent edges, consid- 
ering that nodes sharing FU instances can exist). d\ and 3 
represent the number of clock cycles to reconfigure the TP 
Hj or all the available resources respectively. 

^)=£cs(7r / )+a l <o 

The objective of our algorithm is to furnish a set of 
datapaths that will be executed in sequence with a mini- 
mum number of control steps 1 . Each datapath unit fits on 
the physically available resources. For the sake of mini- 
mizing the number of TPs needed* exploiting sharing of 
FUs while doing temporal partitioning needs to be consid- 
ered by the algorithm. Specifically, our algorithm has to 
output: 

- The set of TPs ( #>): each TP identifying the nodes of 
the DFG assigned Co it; 

- The set of instances for each FU used (<£); 

- Each node of the DFG has to identify a specific FU in- 
stance of 4> implementing the operation. 

From those outputs, it is straightforward to generate a 
behavioral HDL-RTL (hardware description language at 
the register transfer level) description of each TP control 
unit and a structural HDL-RTL description of each 
datapath, considering the existence of a HDL description 
for each FU. The configurations can be generated from 
those netlists using a traditional FPGA design flow. 

3 Algorithm Simultaneously Exploiting 
Temporal Partitioning and Sharing of FUs 

The algorithm uses an initial number of TPs that can be ■ 
specified by the user. Another possibility is to use the 
number of levels of fee DFG or the number of TPs utilized 
by any temporal partitioning algorithm without using shar- 
ing of FUs (e.g., ASAP (1 1]) as the initial number of TPs. 
The user has to specify the total number of available re- 
sources on the device. In addition, for each FU there exists 
a boolean variable which value indicates if the FU can be 
shared or not (sharing of some FUs may need more re- 
sources than the utilization of several FU instances, due to 
the overhead of using auxiliary circuits needed for the im- 
plementation of the sharing mechanism). 

To a clear description, we show the main steps of the 
algorithm with a connection to Example 1. A brief exposi- 



2 Wc assume that each comrol/time step fox scheduling is equal to the 
clock period of tte system. Thus, there is no distinction among the v«* of 
dock cycle, control step or nme step. 
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tion of the step$ perforated, when considering sharing of 
all FUs, is stretched in Figure 3. 
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Figure 3. Algorithm execution through an 
example: a) ASAP and ALAP start times; fc) 
The nodes In the critical path identified by 
the gray region; c), d), e) T f) and g) show It- 
erations of the algorithm. 



The algorithm starts with the following steps: 

1. Compute the set of nodes child 3 of each node of the 
DFG; 

2. Map an FU instance to each operation in the DFG (at 
the moment neither consider more than one FU for the 
same operation nor FUs capable to implement more 
than one operation); 

3. Estimate the area and execution latency of each node in 
the DFG according to the FU characterization, existent 
in the component library, for the target device. This 
step is beyond the scope of this article and from now 
on we will assume that there exists, for each FU, an es- 
timation of the number of resources and of the execu- 
tion latency; 



1 A po4c vj is child of a node ^ if there crisis a path ftom vj to the end 
pf the DFG that includes ^ 



4. Perform the ASAP (as soon as possible) and ALAP (as 
late as possible) start times for each node in the DFG 
(see Figure 3a)), both unconstrained. When doing the 
ALAP scheme, the algorithm also calculates the ALAP 
level of each node; 

5. Determine the set of nodes in one Of the critical paths 
of the DFG (see Figure 3b)); 

6. Create a number of TPs equal to the input number 
specified (see the three TPs initially created in Figure 
3c»; 

7. Assign each node of the set of nodes in one of the criti- 
cal paths of the DFG (determined in point 5) to a TP by 
ascending'leveL When the number of TPs is larger than 
the number of nodes in the critical path, the last TPs 
are left empty; otherwise the last nodes of the set are 
left unassigned (see the nodes assigned to each TP in 
Figure 3c)); 

8. Assign the size (number of resources used) of a node m 
a TP to the current size of that TP (see Figure 3c)). 

After the above steps the main kernel of the algorithm 
is executed (see the pseudo-code in Figure 4, Figure 5 and 
Figure 6). Some of the most important functions used by 
the algorithm are listed and briefly explained below: 

- Vj.ALAPkvfiiO: returns the level of Vj considering an 
ALAP leveling scheme; 

- Vj ALAPStartO: returns the ALAP start time of v f ; 

- it*.addEt(vi): adds the node v» to m& 

- jij.rmEl(Yj): removes v* from n; 

- nj.sched(vi): returns the number of control steps of the 
critical path considering that Vj is mapped to rc*; 

- p.add(fli): adds a new TP to the current set of TPs 
fa will be the last TP in the set); 

- p<t\ At(i): returns the i* TP from the set of TPs ( &); 

- JindNodesO): returns a list of nodes ready to be 
mapped to the i* TP. 

Our algorithm wi]i be progressively constructing a 
global solution* On each iteration, the algorithm traverses 
the sequence of the existent TPs trying to assign ready 
nodes to each TP. Bach TP has an associated maximum 
slot time (MAXc$)* A node ready to be mapped to a TP is 
only really considered for mapping if the resultant execu- 
tion latency of that TP (considering the mapping) does not 
exceed the correspondent MAX^ (line 15 of Figure 4 and 
lines 2, 21 and 29 of Figure 5). MAXcs of a given TP iq is 
equal to the critical path latency of that TP added by a re- 
lax amount: CS(«i) + relax. On each iteration over the TPs 
the relax value is incremented by the great common divi- 
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sor feed) among all the execution latencies of the opera- 
tions in the DFG (line 24 of Figure 4). When a node is 
mapped (see function mapNode in Figure 6),. the critical 
path length of the associated TP is updated (lines 4 and 5 
of Figure 6). 

The algorithm.eonsiders that nodes in contiguous time 
steps mapped to the same TP and with the same operation 
should be bound to the same FU instance. 

A list of nodes ready for mapping to a current TP is 
used. The list has the nodes sorted by increasing ALAP 
start times (the candidate operation having the least ALAP 
value will have the highest priority) and, for node* with 
the same ALAP start time, it uses the ASAP start time as a 
tiebreak (by ascending or descending order). The list is de- 
termined examining for a given node its predecessors (they 
already must be mapped in TPs before the current TP) and 
the child set (the nodes child of the node to be mapped 
must be on TPs after the TP under consideration). The in- 
cremental update of the list of the nodes candidate to be 
mapped to the current TP, when each node is mapped* is 
an option of the algorithm (lines 6 and 7 in Figure 6). 
When such option is disabled the. algorithm only tries to do 
update when the list is empty. The algorithm uses a static- 
based approach in the sense that the ALAP/ASAP values 
are calculated only once and they are no more time up- 
dated. 
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// oegin main kernel 

BitSefc STodesSched - marked with Che nodes al- 

steady mapped to TPs; 

int fiTumTP = Oj relax - Or 

int acep a gcdtAll nodes in DFG) ; 

while ( notAilNodeaSched (NodesSehed) } { 

noop B: while (SchectNum < raaxParcitions) { 

vector listSeady « findtfodes(NumTP) ; 

Vector Xt = &. elAt (HumTP) j 

%rhile(!listKeady.isEm^ty()) { 
Node v H a listReady.rmFirat{) y 

iat Rot* m RClEj) + R{v k ) ,- 

Boolean, fit = (Raw* <n H*ax> ? 

// cs(rti) when vfc is mapped to »iS 

int cs a9yt » Tti.schedCvb) j 

i£(<C5 naw > (CS<?i*)*relax)} && <<i?ris 

the last TP) && <NUi) =- 0) J| 

CxyroScAed {Rua*, V*, fit, CSr.» - 

cs(ici), Kt, NodesScked, update, C$n e «, 
ListReady); Figure 5 
) •!••,{ 

tryroSchedCRrtj,, V*, fit, relax, ftx, 
tfodesSched, update, C6 D «w, Ms- 
tReady); Figure 5 
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relax +- otep? 
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14. 
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tryi*oScned<i*t Rsmir Mode vi f Boolean fit. Ant 
relax, tp ir*, BitSct NodesSched, Boolean upr 
data, iut CSaev, X»ietReady) { 

i£((CS a «£ (relax+CS [**) ) II <CS (iti.) «~ O) ) 

{ 

is only one node v 5 in jek { 

i« (vj , AI^PS tart C ) > v 4 .ALAPStar t ( J ) { 
ir((R«« - R(v 5 >) <= fi***> { 
jik.rmEi<v 3 ) - 
r(j^) « R<Vi)/ 
Wx-addSKvi) ; 
Node BSched* clear (v } ) ; 
NodesSched. set (vi) j 
continue LOOP B? 

boolean canShaxe = try sharing with a 
node o£ the same type with a path of 
shared POs with the smallest length 
(nufttbe* nodes; 
if (canShare) { 

A* (fit &t share produces increase) i 
canShara = false; 
nnShare (Vi) ; 
> eiaa i 

int CSa^i » ttit.sched^Vi) ; 
i£(cswS {relax + CS (**))> { 
piap^OdedTk! Vi, btadesSched* up- 
date, CSnvw IiistReady)? $ 
> *l*e { 
rffiShare(vi) j 
canSfcare = false; 



end main kernel 



33. 
34. 

35. 

36. 
37. 
38. 
39. 



1. 

2- 
3. 
4. 
5. 

7. 



Figure 4. Mam kernel of the proposed algo- 
rithm. 



iftieanShare && fit && (CS a .# S (relax* 
CS (**)>) || <CS(^) == D>) { 

maptfode (it*, vt. tfodesSched. update* 
CSd«w* liistfceady] .- Pigure s 

if (vi not mapped and xio Ftf with opera- 
tion type olS Vi in thisxp and v s does not 
fit and this tp is the last TP) £ 
create a new TP 
j?. add (rc n > ; 

m4U?Node<tt^, vt, NodesSched, update, 
CS (J^) , iistReady) Figure G 
hreaX LOOP 3; 

} // end tryToSch&d 

Figure 5. Function tryToSched. 

mapBtodeCTP tfode Vi, aitSet Nodessched, 
Boolean update, infc GSn**, Vector ListReady) { 

fth.addEKvi) ; 
NodesSched.set(Vi) ; 

i*<CS,w* > CS(ltfc)) 

^{fl^) = Ce atrt# ; 
i£ (update) 

upDateAndSortAIAP(IiistReacty, v x ) ? 
} // end oiapNode 

Figure 6. Function mapNode. 




4.1 Sharing versus not sharing 

Table III shows results for the considered examples. 
Our* and Our** identify results obtained by applying the 
proposed algorithm. Our* considers resource sharing for 
both adder and multipherunits, and Our** only considers 
resource sharing for multiplier units. #cs identifies the 
execution latency (number of clock cycles) and £p the 
number of TPs. Each solution related to our algorithm was 
obtained in less than Is of CPU time. 

Tab!© HI. Results obtained for the exam- 
ples. . 







Approach 


Example 




ASAP 


SA 


Our* 


Our** 
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18 
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18 


35 
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34 


SEHWA 


10 
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24 
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19 
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18 
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18 
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18 
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15 
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15 


5 


15 


HAL 
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11 
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10 
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10 


1 10 
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10 
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7 
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12 


26 


12 


23 
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23 
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25 


EWF 


10 


6 


22 
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22 
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18 
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18 




! 15 
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19 
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18 
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17 
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18 
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14 


28 


14 


27 
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20 
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"27 


FIR 


10 


7 


16 


7 


15 
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*5 


4 


IS 




15 


5 


12 


5 


11 


1 


n 


3 


11 




6 


72 


136 


72 


134 


1 


130 


21 


130 


MAT4x4 


10 


37 


69 


37 


69 


1 


66 


17 


66 


15 


25 


47 


25 


46 


1 


46 


10 


46 




20" 


16 


29 


16 


29 


2 


29 


4 


29 



The SA results were obtained with a simulated anneal- 
ing version to do temporal partitioning without resource 
sharing proposed in £16]. Here, the algorithm is tuned to 
optimizing the overall execution time (the algorithm can 
also exploit the tradeoff between execution time and 
communication costs). The ASAP results refer to the level- 
ing technique proposed in El 1], 

Only Mat4x4 needed to start with the number of TPs 
obtained by the ASAP approach to achieve the best solu- 
tion. For all the other examples, the best solution was ob- 
tained starting with an initial number of TPs equal to the 
number of levels of the DFG. The results for Mat4x4 in 
Table III warn collected disabling the update of the list of 
nodes ready for each node mapped (the list is updated only 
when it is empty). It is strongly recommended to disable 
the update option for examples with high-level degree of 
parallelism and a small critical path length. 

The values in bold in the 6 th and 8 th columns of Table 
HI show the minimum execution latency for the datapaths 
obtained by the considered approaches (not considering 
configuration times). The values in bold in the 10* column 
represent that, even without considering sharing of adders, 
our algorithm returns solutions with execution latencies 
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equal to the execution latencies obtained sharing all the re- 
sources (8* column), despite the fact that those solutions 
need more TPs. 

When considering resource sharing for all FUs, a 
minimum number of TPs (only 4 cases of Table in needed 
more than one TP to produce a minimum execution time) 
seems to ensure solutions with lower execution latencies 
than the obtained by doing temporal partitioning with 
ASAP or SA for the majority of the examples (only one 
case is not as good as SA). Note that when all the FUs can 
be shared and the resource overhead to implement sharing 
is not taken into account, an empirical observation tell us 
that the solutions with lower execution latency are those 
with only one TP. This is expected by the feet that a new 
TP produces an equal or worse effect than sharing FU in- 
stances on the overall execution latencies because all the 
nodes in that TP can only start executing after the end of 
the execution of the TP immediately before. 

When sharing of adders is not considered the algorithm 
is capable to find 13 solutions without inferior execution 
latency. 

4.2 Exploiting the number of TPs 

An exploitation of the overall execution latency versus 
the number of TPs is shown in Figure 8. Those results 
were produced by calling the algorithm several times, each 
time starting with a different initial number of TPs from a 
range of I to 15. The exploitation has been done in ap- 
proximately 5-4$ of CPU time. Ail the solutions use only a 
single TP and the best result (execution latency equal to 66 
clock cycles) has been achieved when the algorithm 
started with 8 TPs. The results without considering sharing 
of adders are shown in Figure 9. The algorithm exploited a 
range of TPs from 1 to 26 and the minimum execution la- 
tency achieved was 66 clock cycles (solution with 21 TPs). 
Based on those results we can seleot a solution that mini- 
mizes the global execution latency taking into account the 
reconfiguration times (see equation (2)). 




Initial number of Temporal Partitions 



Figure 8. Execution latency versus the ini- 
tial number of TPs for Mult4x4 obtained by 
the proposed algorithm, when Rmax=10 
(sharing of adders and multipliers)* 
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From the results presented so far we may conclude that 
sharing FUs can reduce the number of TPs. without in- 
creasing the overall execution time. Moreover, a minimum 
number of TPs can be a priority, when an FPGA with sig- 
nificant reconfiguration times is used. Due to its low com- 
putational complexity, the algorithm can be used to exploit 
the design space based on the tradeoff between the number 
of TPs and the overall execution latency. 
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Figure 9. Execution latency and the final 
number of TPs versus the initial number of 
TPs obtained by the algorithm for MuH4x4, 
when Rmax=10 (no sharing of adders). 



43 Comparison with other schedulers 

At this point a question may occur: is the algorithm 
competitive when a single TP is envisaged? Table IV 
shows results for EWF and SEWHA 7 considering various 
sizes for the available resources (Rmax). The schedules ob- 
tained by the proposed algorithm considering only one TP 
are shown (see the 5 th column). The number of resources 
used for each type of FU for each solution is also shown 
(last column). "Fixed" refers to results collected from the 
state-of-the-art schedulers I17][18)[19] and represent op- 
timal (identified with *) or near-optimal scheduling re- 
sults (without enter into account with temporal partition- 
ing) for the specified constraint on the number of FUs for 
each type of operation (see the 2 nd column). The results 
show that our algorithm is efficient even when we are in- 
terested on a final solution with a single TP. 

The result labeled with a "* w is achieved without an in- 
cremental update of the list of the nodes ready to be 
mapped. This result shows that the algorithm did not skip 
from a local minimum, since at least the result related to 
Rmax^IS should be achieved. The first 4 results obtained 
for SEHWA consider the increasing order of the ASAP 
values as the second key (there is no evidence to suggest 
when it is better to use the decreasing or the increasing 
ASAP values as the second key). 

The number of each FU instance allocated by our algo- 
rithm for each 'R MAX constraint only was different in two 



cases to the constraints used (with total number of re- 
sources equal to Rmax) to produce the near-optimal sched- 
uling results (see Table IV). Therefore, it seems that our 
algorithm can also be used to a fast identification of the 
number of FU instances needed, considering a specific 
number of maximum resources available on the device. 

Table IV. Comparison of scheduling results 
obtained for EWF and SEHWA. 
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5 Related Work 

As far as we know, the development of temporal parti- 
tioning algorithms was firstly considered in [9][2]. The 
similarities of both scheduling on high-level synthesis [8] 
and temporal partitioning allow the use of common sched- 
uling schemes for partitioning. Some authors, such as 
P][10], have considered temporal partitioning at behav- 
ioral levels having in mind the integration of synthesis. 

In [9], a heuristic based on a static list scheduling algo- 
rithm, enhanced to consider temporal partitioning and par- 
tial reconfiguration, is shown. The approach exploits the 
dynamic reconfiguration capability of the devices, while 
doing temporal partitioning. 

In t!0][20] the temporal partitioning problem is mod- 
eled in a specified 0-1 non-linear programming (NLP) 
modeL The problem is transformed to integer linear 
programming (1LP) and the solution determined by an ILP 
solver. Due to the long execution times, this approach is 
not practical for large input examples. Some heuristic 
methods have been developed to permit its usability on 
larger input examples [21]. Kaul [22] exploits the loop fis- 
sion technique while doing temporal partitioning in the 
prtsence of loops to minimize die overall latency by utili- 
zation of the active TP as long as possible. Sharing of 




functional units is considered inside tasks and temporal 
partitioning is performed at the task level- Design space 
exploitation is performed by inputting to the temporal par- 
titioning algorithm different design solutions for each task. 
Such solutions are generated by a high-level synthesis tool 
(constraining the number of FUs of each type). This ap- 
proach lacks a global view and is time-consuming. 

The simplest approaches only consider temporal parti- 
tioning without exploiting sharing of FUs* In [1 1], both a 
temporal partitioning algorithm based on leveling the op- 
erations by an ASAP scheme and other based on clustering 
a number of nodes are used. The algorithm fills the avail- 
able resources in the increasing order of the ASAP levels. 
The selection of nodes in the same level is arbitrary and 
the algorithm switches to another TP when it encounters 
the first node that does not fit on the current TP. The ap- 
proach does not consider neither communications costs nor 
resource sharing. In [23] another algorithm is presented 
that selects the nodes to be mapped in a TP with two dif- 
ferent approaches, (one for satisfying parallelism and an- 
other for decreasing communication costs). In [12], an al- 
gorithm based on the extension of the ASAP or ALAP 
leveling schemes resorting to the mobility of each node to 
select among the nodes has been considered. [12] also 
shows an algorithm that searches recursively in the list of 
ready nodes so that if a node cannot be mapped to the cur- 
rent partition, other nodes can be considered. 

[16] considers both communication costs among differ- 
ent TPs that can occur and the overall execution time. The 
authors presented an extension to static list scheduling, 
which permits to the algorithm sensitivity to the communi- 
cation costs while trying to minimize the overall execution 
time. The results presented, when compared to near- 
optimal solutions obtained with a simulated annealing al- 
gorithm tuned to do temporal partitioning while minimiz- 
ing an objective function, that integrates the execution 
time of the TP$ and the communication costs, revealed the 
efficiency of the approach. 

[24] presents a method to do temporal partitioning con- 
sidering pipelining of the reconfiguration and execution 
stages. The approach divides an FPOA into two portions to 
overlap the execution of a TP in one portion (previously 
reconfigured) with the reconfiguration of the.other portion. 

In [25] constraint logic programming is used to solve 
temporal partitioning, scheduling, and dynamic module al- 
location. However, the approach needs a specification of 
the number of each FU before processing and may suffer 
of long runtimes. 

More related to our approach is the algorithm presented 
in [26]. A scheme based on the force-directed list schedul- 
ing algorithm that considers resource sharing and temporal 
partitioning is shown. The algorithm tries to minimize the 
overall execution time, performing a tradeoff between the 
number of TPs and sharing of FUs. However, the approach 
adapted a t scheduling algorithm not originally tailored to 
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do temporal partitioning and lacks of a global view, in- 
stead, our approach proposes a novel algorithm matched to 
the combination of temporal partitioning and sharing of 
FUs mat maintains a global view. 

6 Conclusions an* Future Work 

In this paper we have presented a new and useful algo- 
rithm combining temporal partitioning, sharing of func- 
tional units, scheduling, allocation and binding. Unlike 
other approaches, this algorithm merges those tasks in a 
combined and global method. The obtained results, from a 
number of benchmarks, strongly confirm the efficiency 
and effectiveness of the idea. 

The low computation time achieved, when dealing with . 
the presented examples, shows that the algorithm is fast 
and efficient and thus can be used on large examples. 

The inclusion of functional units with pipeline stages 
and the consideration of more than one implementation for 
a given operation will be considered in a near future. An- 
other important issue is the overlapping of reconfiguration 
and execution that should be considered by future en- 
hancements. Finally, aspects related to conditional paths 
and loops will also need to be focused of future work. 
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1 Introduction 

This document describes a method for compiling a subset of a high-level programming language (HLL) 
like C or FORTRAN, extended by port access functions, to a reconfigurable data-flow processor (RDFP) 
as described in Section 3. L e., the program is transformed to one or several configurations of the RDFR 

This method can be used as part of an extended compiler for a hybrid architecture consisting of standard 
host processor and a reconfigurable data-flow coprocessor. The extended compiler handles a fell HLL 
like standard ANSI C. It maps suitable program parts like inner loops to the coprocessor and the rest of 
the program to the host processor. However, this extended compiler is not subject of this document 

« 

2 Compilation Flow 

This section briefly describes the phases of the compilation method, 
2*1 Frontend 

The compiler uses a standard frontend which translates the input program (e.g. a C program) into an 
internal format (IF) consisting of an abstract syntax tree (AST) and symbol tables* The frontend also 
performs well-known compiler optimizations as constant propagation, dead code elimination, common 
subexpression elimination etc. For details, refer to any compiler construction textbook like [1]- E. g., the 
SUIF compiler [2] can be used for this purpose. 



2*2 Temporal Partitioning 

Next, the program's IF representation is partitioned into sections which are executed sequentially on the 
RDFP by separate configurations. If the entire program can be executed by one configuration (fitting on 
the given RDFP), no temporal partitioning is necessary. This phase generates reconfiguration statements 
which load and remove the configurations sequentially according to the original program's control flow. 



2,3 Configuration Generation 

Finally, the program sections determined by the temporal partitioning are mapped to RDFP configura- 
tions. This phase generates a program code or data structure which is then used to directly program the 
RDFR 



3 Configurable Objects and Functionality of a RDFP 

This section describes the configurable objects and functionality of a RDFR A possible implementation 
of the RDFP architecture is a PACT XPP™ Core, Here we only describe the minimum requirements for 
a RDFP for this compilation method to work. The only data types considered are multi-bit words called 
data and single-bit control signals called events. Data and events are always processed as packets, cf. 
Section 3.2 f 
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3.1 Configurable Objects and Functions 

An RDFP consists of an airay of configurable objects and a communication network. Each object can 
be configured to perform certain functions (listed below). It performs the same function repeatedly until 
the configuration is changed The array needs not be completely uniform, L e* not all objects need to be 
able to perform all functions. E. g., a RAM function can be implemented by a specialized RAM object 
which cannot perform any other functions. It is also possible to combine several objects to a "macro'* to 
realize certain functions. Several RAM objects can, e. g. , be combined to realise a RAM function with 
laiger storage. 

After a configuration has been removed, all information is lost Only the contents (values) of a RAM are 
preserved during reconfiguration. 
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Figure 1 : Functions of an RDFP 

The following functions mainly handling data packets can be configured in an RDFR See Kg. 1 for a 
graphical representation, 

• ALUfopcode]; ALUs perform common arithmetical and logical operations on data. ALU func- 
tions ("opcodes 7 *) must be available for all operations used in the HLL. 1 ALU functions have two 



'Otherwise programs containing operations which do not have ALU opcodes in the RDFP must be excluded from the 
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data inputs A and B, and one data output X« Comparators have an event output U instead of the 
date output They produce a 1-eveut if the comparison is true, and a 0-event otherwise- 

• CNT: A counter function which has data inputs LB, UB and INC (lower bound, upper bound 
and increment) and data output X (counter value). A packet at event input START starts the 
counter, and event input NEXT causes the generation of the next output value (and output events) 
or causes the counter to terminate if UB is reached. If NEXT is not connected, the counter counts 
continuously. The output events U 7 V, and W have the following functionality: For a counter 
counting N times, N-l event packets with value 0 (0-events) and one event packet with value 
1 (1 -event) are generated at output U. At output V, N 0-events are generated, and at output W 7 
N O-events and one 1 -event arc created. The 1 -event at W is only created after the counter has 
terminated, u e* a NEXT event packet was received after the last data packet was output. 

• RAMjsize]; The RAM function stores a fixed number of data avoids ("size"). It has a data input 
RD and a data output OUT for reading at address RD. Event output ERD signals completion of 
the read access. For a write access, data Inputs WR and IN (address and value) and data output 
OUT is used. Event output EWR signals completion of the write access. ERD and EWR always 
generate 0-events* Note that external RAM can be handled as RAM functions exactly like internal 
RAM. 

• GATE: A GATE synchronizes a data packet at input A back and an event packet at input K When 
both have arrived, they are both input$ consumed. The data packet is copied to output X, and the 
event packet to output U. 

• MUX: A MUX function has 2 data inputs A and B, an event input SEL, and a data output X. If 
SEL receives a 0-packet, input A is copied to output X and input B discarded* For a 1 -packet, B is 
copied and A discarded. 

• MERGE: A MERGE function has 2 data inputs A and B, an event input SEL, and a data output X. 
If SEL receives a 0-packet, input A is copied to output X, but input B is not discarded. The packet 
is left at the input B instead* For a 1 -packet, B is copied and A left at the input. 

• DEMUX: A DEMUX function has one data input A, an event input SEL, and two data outputs X ' 
and Y. If SEL receives a 0-packet, input A is copied to output X, and no packet is created at output 
Y. For a 1 -packet; A is copied to Y, and no packet is created at output X. 

• MDATA: A MDATA function multiplicates data packets. It has a data input A, an event input 
SEL, and a data output X. If SEL receives a 1 -packet, a data packet at A is consumed and copied 
to output X. For all subsequent 0-packets at SEL, a copy of the input data packet is produced at 
the output without consuming new packets at A. Only if another 1 -packet arrives at SEL, the next 
data packet at A is consumed and copied. 2 

• INPORTEname]: Receives data packets from outside the RDpP through input port "name" and 
copies them to data output X* If a packet was received, a 0-event is produced at event output U, 
too, (Note that this function can only be configured at special objects connected to external busses.) 

• OUTPORTfname]: Sends data packets received at data input A to the outside of the RDFP through 
output port "name**. If a packet was sent, a 0-event is produced at event output U, too. (Note that 
this function can only be configured at special objects connected to external busses.) 

supported HLL subset or substituted by "macros" of existing functions. 

2 Note that this can be implemented by a MERGE with special properties on XPP- 
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Additionally, the following functions manipulate only event packets: 

• 0-HLTER, I -FILTER: A FILTER has an input E and an output U. A 0-FILTER copies, a G-event 
from E to U t but 1-EVENTs at E are discarded A 1 -FILTER copies I-events and discards O-events. 

• INVERTER: Copies all events from input E to output U but inverts its value, 

• 0-CONSTANT, 1-CONSTANT: O-CONSTANT copies all events from input E to output U, but 
changes them ail to value (>♦ 1-CONSTANT changes all to value 1. 

• ECOMB: Combines two or more inputs El , E2, E3.-» producing a packet at output U. The output is 
a 1 -event iff one or more of the input packets are l-events (logical or). A packet must be available 
at all inputs before an ouput packet is produced 3 

• ESEQ[seq]: An ESEQ generates a sequence "seq" of events, e.g. "0001" at its output Xt. If it 
has an input START, one entire sequence is generated for each event packet arriving at U. The 
sequence is only repeated if the next event arrives at U, However, if START is not connected, 
ESEQ constantly repeats the sequence. 

3*2 Packet-based Communication Network 

The communication network of an RDFP can connect an outputs of one object (L e. its respective func- 
tion) to the input(s) of one or several other objects. This is usually achieved by busses and switches. By 
placing the functions properly on the objects, many functions can be connected arbitrarily up to a limit 
imposed by the device size. As mentioned above, all values are communicated as packets, A separate 
communication network exists for data and event packets. The packets synchronize the functions in a 
data-flow fashion. I. e., the function only executes when all input packets are available (apart from the 
exceptions where not all inputs are required as described above). The function also stalls if the last output 
packet has not been consumed. Therefore a data-flow graph mapped to an RDFP self-synchronizes its 
execution without the need for external control. Only if two or more function outpus are connected to 
the same function input (N to 1 connection), the self-synchronization is disabled. The use has to ensure 
that only one packet arrives at a time. Otherwise a packet xmgbt get lost, and the value resulting from 
combining two or more packets is undefined. Therefore this should be avoided. However, a function 
output can be connected to many function inputs (I to N connection) without problems. 

There are some special cases; 

• A function input can be preloaded with a distinct value during configuration* This packet is con- 
sumed like a normal packets comtog from another object. 

• A function input can be defined as constant. la this case, the packet at the input is reproduced 
repeatedly for each function execution. It is even possible to connect an output of another function 
to a constant input In this case, the constant value is changed as soon as a new packet arrives at 
the input. Note that there is no self-synchronization in this case, too. The function is not stalled 
until the new packet arr ives since the old packet is still used and reproduced. 

3 Note that this function is implemented by the EAND operator on the XPP. 
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An RDFP requires register delays in the dataflow. Otherwise very long combinational delays and asyn- 
chronous feedback is possible. We assume that delays are inserted at the inputs of some functions flike 
for most ALUs) and in some routing segments of the communication network. 

4 Temporal Partitioning 

Tke details of Temporal Partitioning need to be inserted from /. Cardoso's documents. 

5 Configuration Generation 
5.1 Language Definition 

The following HLL features axe not supported .by the method described here: 
+ pointer operations. 

• library calls, operating system calls (including standard I/O functions) 

• recursive function calls (Note that non-recursive function calls can be eliminated by function in- 
lining and therefore are not considered here.) 

• All scalar data types are converted to type integer. Integer values are equivalent to data packets in 
the RDFP. Arrays (possibly multi-dimensional) are the only composite data types considered. 

The following additional features are supported: i 

INPORXS and OTJTPORTS can be accessed by the HLL functions getstream(name> value) and put- 
stream(name, value) respectively 

SJ2 Mapping of High-Level Language Constructs 

This method converts a HLL program to a control/data-flow graph (CDFG) consisting of the RDFP 
functions defined in Section 3.1. Before the processing starts, all HLL program arrays are mapped to 
RDFP RAM, functions. An array x is mapped to RAM RAM(x). If several arrays are mapped to the 
same RAM, an offset is assigned, too. the RAMs are added to an initially empty CDFG* There must be 
enough RAMs of sufficient size for all program arrays. 

The CDFG is generated by a traversal of the AST of the HLL program. The following two pieces of 
information are maintained at every program point 4 during the traversal: 

4 Jn a program, program points are between two statements or before the beginning or after the end of a program structure 
like a loop or a conditional statement. 



2002-1-18 xppvcpsLt V0.1 Confidential 



♦ 



060 20.01.2003 23:26:37 
A Method for Compiling J^^Levei Language Programs to a ReconfiguraG^ata-Flow Processor 7 



• STARTpoints to an event output of an object. It delivers a 0-event whenever the program execution 
at this program point starts. At the beginning, a 0-CONSTANT preloaded with an event input is 
added to the CDFG. (It delivers a 0-event immediately after configuration.) START initially points 
to its output. The STOP signal generated after a program part has finished executing is used as 
new START signal for the next program part or signals termination of the entire program. 

• VARUST is a list of {variable, object-output} pairs. The pairs map integer variables (no anays) 
to a CDFG object's output. The first pair for a variable in VARUST contains the output of the 
object which produces the value of this variable valid at the program point. New pairs are always 
added to the front .of VARLIST. The expression VARDEF(var) refers to the the object-output of 
the first pair with variable var in VARLIST. 5 

The following subsections systematically list all HLL components and describe how they are processed, 
thereby altering the CDFG, START and VARLIST 



5-2*1 Integer Expressions and Assignments 

Straight-line code without airay accesses can be duecdy mapped to a data-flow graph. One ALU is 
allocated for each operator in the program. Because of the self-synchronization of the ALUs* no explicit 
control or scheduling is needed. Therefore processing these assignments does not access or alter START 
The data dependences (as they would be exposed in the DAG representation of die program [1]) are 
analyzed through the processing of VARLIST These assignments synchronize themselves through the 
data-flow. The data-driven execution automatically exploits the available instruction level parallelism. 

All assignments evaluate the right-hand side (RHS) or source expression. This evaluation results in a 
pointer to a CDFG object's output (or pseudo-object as defined below). For integer assignments, die 
left-hand side (LHS) variable or destination is combined, with the RHS result object to form a new pair 
{LHS, resuIt(RHS)} which is added to the front of VARLIST. 

The simplest statement is a constant assigned to an integer. 6 

a » 5; 

. It doesn't change the CDFG, but adds {a, 5} to the front of VARLIST The constant 5 is a "pseudo- 
object" which only holds the value, but does not refer to a CDFG object. Now VARDEF(a) equals 5 at 
subseqent program points before a is redefined. 

Integer assignments can also combine variables already defined and constants: 
b « a * 2 + 3; 

In the AST, the RHS is already converted to an expression tree. This tree is transformed to a combination 
of old and new CDFG objects (which are added to the CDFG) as follows: Each operator (internal node) 
of the tree is substituted by an ALU with the opcode corresponding to the operator in the tree. If a leaf 
node is a constant, the ALU's input is directly connected to that constant. If a leaf note is an integer 
variable var, it is looked up in VARLIST, L e, VARDEF(var) is retrieved. Then VARDEF(var) (an output 
of an already existing object in CDFG or a constant) is connected to the ALU's input. The output of the 
ALU corresponding to the root operator in the expression tree is defined as the result of the RHS. Finally, 

*Tfais method of using a VARLIST Is adapted from the TnmSmogrifier C compiler [3]. 
Note that we use C syntax for the following examples. 
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a new pair {LHS, result(RHS)} is added to VARL1ST. If the two assignments above are processed* the 
CDFG with two ALUs in Fig, 2 is created. 7 Outputs occurring in VARLIST are labeled by Roman 
numbers. After these two assignments, VARLIST = [{b, I}, {a, 5}]. (The front of the list is on the left 
side.) Note that all inputs connected to a constant (whether direct from the expression tree or retrieved 
from VARLIST) must be defined as constant Inputs defined as constants have a small c next to the input 
arrow in Fig. 2. 

5.2.2 Conditional Integer Assignments 

For conditional if-then-else statements containing only integer assignments, objects for condition eval- 
uation are created first. The object event output indictating the condition result is kept for choosing 
the correct branch result later. Next, both branches are processed in parallel, using separate copies 
VARUSTI and VARUST2 of VARLIST (VARLIST itself is not changed.) Finally, for all variables 
added to VARUSTl or VARUST2, a new entry for VARLIST is created (combination phase). The valid 
definitions from VARLIST1 and VARLIST2 axe combined with a MUX function, and the correct input 
is selected by the condition result For variables only defined in one of the two branches, the multiplexer 
uses the result retrieved from the original VARLIST for the other branch. If the original VARLIST does 
not have an entry for this variable, a special "undefined" constant value is used. However, in a function- 
ally correct program this value will never be used. As an optimization, only variables live [1] after the 
if-then-else structure need to be added to VARLIST in the combination phase. 

Consider the following example: 

i = 7; 
a * 3; 

if (i < 10) { 
a - 5; 
c = 7; 

} 

else { 

c = a - 1; 
d « 0; 

} 

Fir. 3 shows the resulting CDFG. Before the if-then-else construct. VARLIST = [{a, 3}, {i, 7}]. After 
processing, the branches, for the then branch, VARLIST1 = [{ c , 7}, {a, 5}, {a, 3}, {i, 7>], and for the 
else branch, VARLIST2 = T{d, 0}, {c, I}, {a, 3}, {i, 7}]. After combination, VARLIST = [{d, H}, {c, 
m>,{a,IV},{a,3},{i,7H 

Note that case- or switch-statements can be processed, too, since they can - without loss of generality - 
be converted to nested if-then-else statements. 

This processing of conditional statements doesn't need explicit control, either. Both branches are exe- 
cuted in paral lel and synchronised by the data-flow. 

jNow that che input and output names can be deduced from their position, ef. Fig. L Also note that the comber fron- 
tend would normally have substituted the second assignment by b = 13 (constant propagation). For the simplicity of this 
explanation, no frontend oprimizarions are considered in this ana* the following examplsc • 
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5-2.3 Array Accesses 

In contrast to the above sections, array accesses have to be controlled explicitly to maintain the cotrect 
execution order- For a read access the read address is connected to data input RD. For a write access, 
the write address is connected to data input WR and the write value to input IN. Alls these inputs are 
connected to their respective sources through a GATE controlled by START. A STOP event signalling 
completion of the amy access must be assigned to the START variable. Since there's only one START 
event packet available, only one array access can occur at a time, and the execution order of the original 
program is maintained. This scheduling scheme is similar to a one-hot controller for digital hardware. 

If a RAM is read and written at only one program point, the ERD or EWR outputs can be used as STOP 
events. However, if several read or several write accesses (from different program points) to the same 
RAM occur, eachaccess produces a ERD or EWR event, respectively. But a STOP event should only be 
executed for the program: point currently executed, the current access. This is achieved by connecting 
the START signals (i.e. those connected to the GATEs) of all other accesses with the inverted START 
signal of the current access. The resulting signal produces an event for every access, but only for the 
current access a 1 -event. This event is combined (ECOMB) with the RAM's ERD or EWR access. The 
ECOMB's output will only occur after the access is completed. Because ECQMB OR-combines its 
event packets, only the current access produces a 1 -event. Next, this event is filtered with a 1-FILTER 
and changed by a 0-CONSTANT, resulting in a STOP signal which produces a Q-event only after the 
current access is completed as required. See below for an example. 

For computing the RAM addresses, the compiler frontend's standard transformation for array accesses 
can be used. The only difference is that the offset with respect to the RDFP RAM (as determined in the 
initial mapping phase) must be used. 

For several accesses, several sources can be connected to the RD, WR and 3N inputs of a RAM. This 
disables the self-synchronization. However, since only one access at a time can happen, the GATEs only 
allow one data packet to arrive at the inputs. 

For read accesses, the packets at the OUT output face the same problem as the ERD event packets: They 
occur for every read access, but must only be used (and forwarded to subsequent operators) for the current 
access. This can be achieved by connecting the OUT output via a DEMUX function. The Y output of 
the DEMUX is used, and the X output is left unconnected. The it acts as a selective gate which only 
forwards packets if its SEL input receives a 1 -event, and discards its data input if SEL receives a 0-event. 
The signal created by the ECOMB described above for the STOP signal creates a 1 -event for the current 
access, and a 0-event otherwise. Using it as the SEL input achieves exactly the desired funcitonality. 

To avoid redundant read accesses, RAM reads are also registered in VARLIST. Instead of an integer 
variable, an array element is used as first element of the pair. However, a change in a variable occurring 
in an array index invalidates the information in VARLIST. It must then be removed fiom it. 

The following example shows two read accesses: 

x « a [J.]; 
y - a[jj; 
z = x + y; 

Fig. 4 shows the resulting CDFG* Inputs START (old), i and j should be substituted by the actual func- 
tions resulting from the program before the anay reads. The signal indicating the STOP of the first access 
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is marked by STOP1. Write accesses use the same control events, but instead of one GATE per access 
for the RD inputs, one GATE for WR and one gate for IN (wife the same E input) are used. Also no 
outputs need to be handled. 

Fig. 5 shows the access a [i ] = x; for the simple case thai die RAM is only written once. I. e. at one 
program point. 

This scheme executes RAM accesses correctly, but not very fast since all accesses are synchronized even 
if this is not necessary. The following optimizations are possible; 

• Only accesses to the same RAM are synchronized. Accesses to different arrays can occur concur- 
rently or even in changed order. When there is a data dependency, the accesses self-synchronize 
automatically. This can be achieved by maintaining a separate START signal for every RAM. At 
the end of a basic block [1], all these START signals must be combined by a ECOMB to provide 
a new signal for the next basic block. 

h • For sequences of either read accesses or write accesses (not mixed) within a basic block, it.is 

y possible to stream data into the RAM rather than waiting for the previous access to complete. For 

this purpose, a combination of MERGE functions selects the RD or WR and IN inputs in die order 
dictated by the sequence. The MERGES must be controlled by iterative ESEQs guaranteeing that 
the inputs are only forwarded in this order. Then only die first access in the sequence needs to 
be controlled by GATEs, the other GATEs can be removed to increase throughput. Similarly, the 
OUT outputs of a read access can be distributed more efficiently for a sequence. A combination 
of DEMUX functions with the same ESEQ control can be used* For read accesses, the generation 
of the last output can be sent through a GATE (without the E input connected), thereby producing 
4 STOP event 

Fig. 6 shows the following three array reads in the optimized fashion, 
x - a[il; 
^ - a£k]; 



5.2.4 Input and Output Ports 

Input and output ports are processed similar to vector accesses, A read from an input port is like an 
array read without an address. The input data packet is sent to DEMUX functions which send it to the 
correct subsequent operators. The STOP signal is generated in the same way as described above for 
RAM accesses by combining the IMPORT s U output with the current and other START signals. 

Output ports control the data packets by GATEs like array write accesses. The STOP signal is ateo 
created as for RAM accesses. 



5.2,5 General Conditional Statements 

Conditional statements containing either array accesses or inner loops cannot be processed as described 
to Section 5.2.2. Data packets must only be sent to the active branch. Therefore, a dataflow analysis is 
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17^'. Z? y ; 2^ * ""a* 1 " 8 " 1 - Since only one branch is ever activated there 



5,2.6 FOR Loops 



AFOR loop is controlled by a counter CNT. The lower bound (LB), upper bound (UB>, and increment 

1&e *5 <** Section, 5.2. fatd 5.23) 

to^S; SSrt ^f mpUt K C ? U,eCted to 1116 START ^gnal. The new START signal (after 
eZt^S^L ?T $ W ^ Sent ^ I -FILTER and O^CONSTANX (W only oSta a l 

r£S£ ^TJSSSrTTf } C S' SV . 0UtPUtpr0dUCeS oneo^ventforeihloo/ iSon and 

"ZTJ SS £f V rLlrO S ^ r * iS » START 

thenen/i rt i,c i»e ^SLi. OP ^S 1131 -) Tlu s assures that one iteration only starts after 

FOR^^ 

SSb?2! ^ 3t itS bc « fa *» a combination of the input value (from 

feedbacfvaIues bTc?to L^^S^Z ^ c ? ntroHed b * ^ output. It sends the input or 

iJS^S^^S^S ( A Xrlnr S ^ PeX r ti ° n " ^ V^Tusedin thHoop 
theontputof meIoop(^^)^v^u^^r?S^ 0n ' ^ of feedfaa <* val ** « sent to 
not defined in the ^^^^^^"^ ^ DEMU * OQ ** s - 

o^^optut^ DatapaCetsfromvariablesdenned 

signal as explained ^2^1 ? ^ * CrCad0n ° f a feedfc »<* 

consumed i/each loop S^ST^^^S" available (unless it is a constant), but it is 

Thus it is necessary to mm£^S^^ ? 1 ? P ° PeKltl0n ^ SeC ° nd iteretion onwaris - 
function with the SEL ivwZn^TX^pT* ^ ™ S * &y 3 

These methods allow to process arbitrarily nested loops and conditional statements. 
Fig. 7 shows the generated CDFO for the followfci for loop. 



a =» b + c; 

fox (i=0,- i<=10; i++) { 
a = a + i; 
xtt] = k; 

} 



the counter anyway. C aCC6S5es SinCe P acket gyration is controlled by 
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5.2.7 WHILE Loops 

WHILE loops are processed similarly. The STOP signal (new START signal) is generated from the loop 
condition, fed through a 0-HLTER. When the loop finishes, an additional signal (similar to the CNT's 
W output) must be generated which controls the DEMUXes to generate an output. 

5*2.8 Parallelization, Vectori2an*on and Pipelining 

The method described so far generates CDFGs performing the HIX program's functionality onanRDFP. 
However, the program execution is unnecessarily sequentialized by the START signals. In many cases 
this is too restrictive. Several optimizations are possible. " 

Independent loops (operating on different variables and arrays) need not be sequentialized. They can 
use the same START signal, and operate independendy. After execution, their STOP signals must be 
combined by ECOMB, forming a new START signal for the subsequent program parts. 

In some cases, loops can be vectorized. This means that loop iterations can overlap, leading to a pipelined 
data-flow through the operators of the loop body [4], This technique can be easily applied to the methgd 
described here. For FOR loops, the CNT's NEXT input is removed so that CNT counts continuously 
thereby overlapping the loop iterations. Since vectorizable loops have no memory access conflicts the 
read and write accesses to the same RAM can also overlap. Especially for dual-ported RAM this leads 
to considerable performance improvements. In this case separate START signals must not only be main- 
tained for each RAM, but also separately for read and write accesses. 

Finally, loop transformations like loop unrolling, loop distribution, loop tiling or loop merging [41 can 
be aphed to increase the parallelism and improve performance. 
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